-------------------------- DeepSpeed Flops Profiler --------------------------
Profile Summary at step 2:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

world size:                                                             32      
data parallel size:                                                     32      
model parallel size:                                                    1       
batch size per GPU:                                                     16      
params per GPU:                                                         8.09 B  
params of model = params per GPU * mp_size:                             8.09 B  
fwd MACs per GPU:                                                       21.91 TMACs
fwd flops per GPU:                                                      43.82 T 
fwd flops of model = fwd flops per GPU * mp_size:                       43.82 T 
fwd latency:                                                            232.43 ms
fwd FLOPS per GPU = fwd flops per GPU / fwd latency:                    188.55 TFLOPS
bwd latency:                                                            857.36 ms
bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency:                102.23 TFLOPS
fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency):      120.64 TFLOPS
step latency:                                                           387.81 ms
iter latency:                                                           1.48 s  
FLOPS per GPU = 3 * fwd flops per GPU / iter latency:                   88.98 TFLOPS
samples/second:                                                         346.51  

----------------------------- Aggregated Profile per GPU -----------------------------
Top 1 modules in terms of params, MACs or fwd latency at different model depths:
depth 0:
    params      - {'DiT': '8.09 B'}
    MACs        - {'DiT': '21.91 TMACs'}
    fwd latency - {'DiT': '232.26 ms'}
depth 1:
    params      - {'ModuleList': '8.02 B'}
    MACs        - {'ModuleList': '21.82 TMACs'}
    fwd latency - {'ModuleList': '221.31 ms'}
depth 2:
    params      - {'DiTLayer': '8.02 B'}
    MACs        - {'DiTLayer': '21.82 TMACs'}
    fwd latency - {'DiTLayer': '221.31 ms'}
depth 3:
    params      - {'GemmaMLP': '3.77 B'}
    MACs        - {'GemmaMLP': '15.46 TMACs'}
    fwd latency - {'DiTSelfAttention': '94.33 ms'}

------------------------------ Detailed Profile per GPU ------------------------------
Each module profile is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs (or latency) and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed.

DiT(
  8.09 B = 100% Params, 21.91 TMACs = 100% MACs, 232.26 ms = 100% latency, 188.69 TFLOPS
  (layers): ModuleList(
    (0): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 7.05 ms = 3.03% latency, 193.51 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 1.01 ms = 0.43% latency, 2.81 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.02% latency, 1.71 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 331.4 us = 0.14% latency, 8.54 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.4 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.95 ms = 1.27% latency, 133.91 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.36 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 336.89 us = 0.15% latency, 358.57 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.69 us = 0.08% latency, 157.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.83 us = 0.08% latency, 159.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.06 us = 0.07% latency, 197.3 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.2 us = 0.06% latency, 201.05 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 316.62 us = 0.14% latency, 381.52 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.78 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.03 ms = 0.87% latency, 476.37 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.37 us = 0.25% latency, 546.55 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.51 us = 0.25% latency, 549.22 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 543.36 us = 0.23% latency, 592.84 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 111.1 us = 0.05% latency, 377.51 GFLOPS)
      )
    )
    (1): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.88 ms = 2.96% latency, 198.14 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 904.56 us = 0.39% latency, 3.13 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.02% latency, 1.68 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 235.8 us = 0.1% latency, 12.01 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.91 ms = 1.25% latency, 135.41 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.41 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.36 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.48 us = 0.14% latency, 378.1 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.97 us = 0.07% latency, 194.87 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.92 us = 0.06% latency, 200.1 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 311.37 us = 0.13% latency, 387.94 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.54 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.87% latency, 479.69 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.85 us = 0.25% latency, 546.11 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.36 us = 0.25% latency, 551.24 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 543.36 us = 0.23% latency, 592.84 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 99.18 us = 0.04% latency, 422.89 GFLOPS)
      )
    )
    (2): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.96 ms = 2.99% latency, 196.08 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 926.02 us = 0.4% latency, 3.06 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 47.92 us = 0.02% latency, 1.28 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 237.46 us = 0.1% latency, 11.92 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 438.21 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.98 ms = 1.28% latency, 132.25 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 447.27 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.98 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 327.83 us = 0.14% latency, 368.48 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 194.55 us = 0.08% latency, 155.23 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 160.22 us = 0.07% latency, 188.49 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 159.98 us = 0.07% latency, 188.77 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 315.19 us = 0.14% latency, 383.25 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 429.63 us = 0.18% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 483.41 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.79 us = 0.25% latency, 549.89 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.65 us = 0.25% latency, 551.91 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 541.21 us = 0.23% latency, 595.19 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS)
      )
    )
    (3): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.61 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 895.5 us = 0.39% latency, 3.16 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.02% latency, 1.71 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 225.07 us = 0.1% latency, 12.58 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 436.54 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.78 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.17 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 233.17 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 318.53 us = 0.14% latency, 379.23 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.69 us = 0.08% latency, 157.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 160.22 us = 0.07% latency, 188.49 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.4 us = 0.07% latency, 199.47 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.04 us = 0.13% latency, 392.15 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 484.16 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.93 us = 0.25% latency, 552.59 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.6 us = 0.25% latency, 555.77 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 541.21 us = 0.23% latency, 595.19 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.13 us = 0.04% latency, 440.91 GFLOPS)
      )
    )
    (4): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.9 ms = 2.97% latency, 197.66 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 901.7 us = 0.39% latency, 3.14 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 39.34 us = 0.02% latency, 1.56 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 228.17 us = 0.1% latency, 12.41 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.97 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.65 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.89 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 232.7 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.2 us = 0.14% latency, 377.26 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.35 us = 0.08% latency, 160.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.83 us = 0.07% latency, 197.6 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.73 us = 0.06% latency, 201.69 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 322.34 us = 0.14% latency, 374.74 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.73 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.86% latency, 481.86 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.75 us = 0.25% latency, 549 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.56 us = 0.25% latency, 550.11 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.3 us = 0.23% latency, 597.29 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS)
      )
    )
    (5): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.9 ms = 2.97% latency, 197.64 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 912.9 us = 0.39% latency, 3.1 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.68 us = 0.02% latency, 1.44 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 230.55 us = 0.1% latency, 12.28 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.64 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.94 ms = 1.26% latency, 134.42 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.41 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.74 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 325.2 us = 0.14% latency, 371.45 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 193.83 us = 0.08% latency, 155.8 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 189.07 us = 0.08% latency, 159.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.54 us = 0.07% latency, 196.68 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.97 us = 0.06% latency, 201.37 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 321.87 us = 0.14% latency, 375.3 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.3 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.26 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.13 us = 0.25% latency, 551.46 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.46 us = 0.25% latency, 553.04 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.4 us = 0.23% latency, 593.88 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.65 us = 0.04% latency, 443.13 GFLOPS)
      )
    )
    (6): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.42 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 910.76 us = 0.39% latency, 3.11 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 43.39 us = 0.02% latency, 1.42 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 229.12 us = 0.1% latency, 12.36 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.83 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.92 ms = 1.26% latency, 135.15 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.07 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 324.49 us = 0.14% latency, 372.27 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.35 us = 0.08% latency, 160.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.73 us = 0.07% latency, 195.17 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.92 us = 0.06% latency, 200.1 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.75 us = 0.13% latency, 391.24 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 429.39 us = 0.18% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.84 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.99 us = 0.25% latency, 548.77 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.89 us = 0.25% latency, 551.69 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 541.21 us = 0.23% latency, 595.19 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.41 us = 0.04% latency, 444.25 GFLOPS)
      )
    )
    (7): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.85 ms = 2.95% latency, 198.98 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 904.08 us = 0.39% latency, 3.13 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.67 us = 0.02% latency, 1.63 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 231.27 us = 0.1% latency, 12.24 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.64 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.91 ms = 1.25% latency, 135.67 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 442.03 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.27 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.43 us = 0.14% latency, 376.98 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.59 us = 0.08% latency, 160.13 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.4 us = 0.07% latency, 199.47 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 148.77 us = 0.06% latency, 202.99 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 307.08 us = 0.13% latency, 393.37 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 484.45 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.7 us = 0.25% latency, 552.82 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.83 us = 0.25% latency, 555.54 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 540.02 us = 0.23% latency, 596.5 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.84 us = 0.04% latency, 437.62 GFLOPS)
      )
    )
    (8): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.89 ms = 2.97% latency, 197.85 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 896.22 us = 0.39% latency, 3.16 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.91 us = 0.02% latency, 1.62 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 218.63 us = 0.09% latency, 12.95 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 437.02 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.94 ms = 1.26% latency, 134.38 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.17 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.84 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 322.1 us = 0.14% latency, 375.02 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.17 us = 0.08% latency, 157.15 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.26 us = 0.07% latency, 195.77 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.59 us = 0.07% latency, 197.91 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 314.71 us = 0.14% latency, 383.83 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.54 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.86% latency, 481.23 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 588.18 us = 0.25% latency, 547.66 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.79 us = 0.25% latency, 549.89 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.88 us = 0.23% latency, 593.36 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.13 us = 0.04% latency, 440.91 GFLOPS)
      )
    )
    (9): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.9 ms = 2.97% latency, 197.73 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 900.27 us = 0.39% latency, 3.14 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.02% latency, 1.65 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 227.45 us = 0.1% latency, 12.45 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.88 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 442.98 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.74 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 323.3 us = 0.14% latency, 373.64 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.88 us = 0.08% latency, 156.57 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.35 us = 0.08% latency, 160.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.5 us = 0.07% latency, 195.47 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.06 us = 0.07% latency, 197.3 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 311.14 us = 0.13% latency, 388.24 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.02 ms = 0.87% latency, 479.3 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.51 us = 0.25% latency, 549.22 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.65 us = 0.25% latency, 551.91 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 548.84 us = 0.24% latency, 586.92 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.61 us = 0.04% latency, 438.71 GFLOPS)
      )
    )
    (10): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.91 ms = 2.97% latency, 197.5 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 921.25 us = 0.4% latency, 3.07 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.62 us = 0.02% latency, 1.59 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 241.28 us = 0.1% latency, 11.73 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 437.5 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.72 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.17 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.27 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 322.58 us = 0.14% latency, 374.47 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.17 us = 0.08% latency, 157.15 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.02 us = 0.08% latency, 158.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.02 us = 0.07% latency, 196.07 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.87 us = 0.07% latency, 198.84 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 310.18 us = 0.13% latency, 389.44 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.82 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.09 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.7 us = 0.25% latency, 548.11 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.32 us = 0.25% latency, 550.34 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 538.83 us = 0.23% latency, 597.82 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.37 us = 0.04% latency, 439.8 GFLOPS)
      )
    )
    (11): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.89 ms = 2.97% latency, 197.83 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 917.67 us = 0.4% latency, 3.09 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 39.1 us = 0.02% latency, 1.57 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 235.08 us = 0.1% latency, 12.04 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 437.74 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.53 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.17 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.36 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 326.4 us = 0.14% latency, 370.09 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 199.32 us = 0.09% latency, 151.51 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.83 us = 0.07% latency, 197.6 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.97 us = 0.06% latency, 201.37 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 310.9 us = 0.13% latency, 388.54 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.82 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 484.86 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.36 us = 0.25% latency, 551.24 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.83 us = 0.25% latency, 555.54 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 537.63 us = 0.23% latency, 599.15 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.94 us = 0.04% latency, 446.5 GFLOPS)
      )
    )
    (12): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.93 ms = 2.98% latency, 196.9 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 897.88 us = 0.39% latency, 3.15 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.95 us = 0.02% latency, 1.66 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 226.02 us = 0.1% latency, 12.53 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.68 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.96 ms = 1.27% latency, 133.48 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 446.8 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 232.46 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 321.39 us = 0.14% latency, 375.86 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 194.07 us = 0.08% latency, 155.61 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 156.16 us = 0.07% latency, 193.38 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.5 us = 0.07% latency, 195.47 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 324.25 us = 0.14% latency, 372.54 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.73 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.02 ms = 0.87% latency, 478.73 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 588.66 us = 0.25% latency, 547.22 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.99 us = 0.25% latency, 548.77 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 550.51 us = 0.24% latency, 585.14 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS)
      )
    )
    (13): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.93 ms = 2.98% latency, 196.79 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 926.97 us = 0.4% latency, 3.05 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 50.78 us = 0.02% latency, 1.21 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 236.99 us = 0.1% latency, 11.95 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.83 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.95 ms = 1.27% latency, 133.87 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.41 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.55 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 326.63 us = 0.14% latency, 369.82 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 195.03 us = 0.08% latency, 154.85 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.73 us = 0.07% latency, 195.17 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.87 us = 0.07% latency, 198.84 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 310.9 us = 0.13% latency, 388.54 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.73 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.61 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.22 us = 0.25% latency, 548.55 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.36 us = 0.25% latency, 551.24 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.3 us = 0.23% latency, 597.29 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS)
      )
    )
    (14): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.48 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 894.07 us = 0.38% latency, 3.17 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.91 us = 0.02% latency, 1.62 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 224.83 us = 0.1% latency, 12.59 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.66 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.7 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.55 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.67 us = 0.14% latency, 376.7 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.35 us = 0.07% latency, 198.22 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.44 us = 0.06% latency, 200.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 324.49 us = 0.14% latency, 372.27 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.38 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.7 us = 0.25% latency, 548.11 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.08 us = 0.25% latency, 550.56 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.54 us = 0.23% latency, 597.03 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.89 us = 0.04% latency, 442.01 GFLOPS)
      )
    )
    (15): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.84 ms = 2.94% latency, 199.42 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 895.98 us = 0.39% latency, 3.16 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.43 us = 0.02% latency, 1.64 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 228.4 us = 0.1% latency, 12.4 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.16 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 136.02 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 442.27 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.12 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.02 us = 0.08% latency, 158.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.59 us = 0.08% latency, 160.13 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.11 us = 0.07% latency, 198.53 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.01 us = 0.06% latency, 202.66 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 307.56 us = 0.13% latency, 392.76 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.54 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 486.08 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.41 us = 0.25% latency, 552.14 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.12 us = 0.25% latency, 556.23 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 536.68 us = 0.23% latency, 600.21 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS)
      )
    )
    (16): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.44 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 908.85 us = 0.39% latency, 3.12 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.02% latency, 1.67 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 230.07 us = 0.1% latency, 12.31 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.25 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.92 ms = 1.26% latency, 135.36 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.65 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.07 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.96 us = 0.14% latency, 377.54 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.45 us = 0.08% latency, 157.74 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.83 us = 0.08% latency, 159.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.68 us = 0.06% latency, 200.42 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 148.77 us = 0.06% latency, 202.99 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 315.9 us = 0.14% latency, 382.38 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.86% latency, 481.98 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.22 us = 0.25% latency, 548.55 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.03 us = 0.25% latency, 549.67 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.16 us = 0.23% latency, 594.14 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.41 us = 0.04% latency, 444.25 GFLOPS)
      )
    )
    (17): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.86 ms = 2.95% latency, 198.82 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 901.7 us = 0.39% latency, 3.14 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.91 us = 0.02% latency, 1.62 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 227.69 us = 0.1% latency, 12.43 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.59 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.92 ms = 1.26% latency, 135.35 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.13 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 228.17 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.96 us = 0.14% latency, 377.54 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.88 us = 0.08% latency, 156.57 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 160.69 us = 0.07% latency, 187.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.63 us = 0.07% latency, 199.16 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 303.98 us = 0.13% latency, 397.38 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.3 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 483.93 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.93 us = 0.25% latency, 552.59 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.46 us = 0.25% latency, 553.04 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.16 us = 0.23% latency, 594.14 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.7 us = 0.04% latency, 447.64 GFLOPS)
      )
    )
    (18): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.91 ms = 2.98% latency, 197.24 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 904.56 us = 0.39% latency, 3.13 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.02% latency, 1.65 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 227.45 us = 0.1% latency, 12.45 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.35 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.94 ms = 1.27% latency, 134.1 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.13 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 233.17 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.72 us = 0.14% latency, 377.82 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.93 us = 0.08% latency, 157.35 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 189.3 us = 0.08% latency, 159.53 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.26 us = 0.07% latency, 195.77 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.06 us = 0.07% latency, 197.3 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 313.76 us = 0.14% latency, 385 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.25 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.87% latency, 479.86 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 588.66 us = 0.25% latency, 547.22 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.84 us = 0.25% latency, 550.79 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 547.17 us = 0.24% latency, 588.71 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.13 us = 0.04% latency, 440.91 GFLOPS)
      )
    )
    (19): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.9 ms = 2.97% latency, 197.68 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 903.37 us = 0.39% latency, 3.13 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.02% latency, 1.65 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 230.07 us = 0.1% latency, 12.31 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.11 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.96 ms = 1.27% latency, 133.38 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.89 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 233.65 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.91 us = 0.14% latency, 376.42 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 193.6 us = 0.08% latency, 155.99 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 189.78 us = 0.08% latency, 159.13 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 156.88 us = 0.07% latency, 192.5 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 155.45 us = 0.07% latency, 194.27 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 316.14 us = 0.14% latency, 382.09 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.01 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 485.96 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.36 us = 0.25% latency, 551.24 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 577.69 us = 0.25% latency, 557.61 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 536.68 us = 0.23% latency, 600.21 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS)
      )
    )
    (20): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.49 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 891.45 us = 0.38% latency, 3.18 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.02% latency, 1.7 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 223.64 us = 0.1% latency, 12.66 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.88 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.84 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.6 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 234.6 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.91 us = 0.14% latency, 376.42 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.4 us = 0.08% latency, 156.96 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.06 us = 0.07% latency, 197.3 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.01 us = 0.06% latency, 202.66 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 313.76 us = 0.14% latency, 385 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.01 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.32 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.37 us = 0.25% latency, 546.55 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.65 us = 0.25% latency, 551.91 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.3 us = 0.23% latency, 597.29 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.84 us = 0.04% latency, 437.62 GFLOPS)
      )
    )
    (21): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.98 ms = 3.01% latency, 195.29 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 889.3 us = 0.38% latency, 3.18 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.02% latency, 1.67 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 223.64 us = 0.1% latency, 12.66 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.88 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 3 ms = 1.29% latency, 131.62 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.6 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 234.37 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 318.53 us = 0.14% latency, 379.23 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 200.51 us = 0.09% latency, 150.61 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.4 us = 0.08% latency, 156.96 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 168.32 us = 0.07% latency, 179.41 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 159.26 us = 0.07% latency, 189.62 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 324.73 us = 0.14% latency, 371.99 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.04 ms = 0.88% latency, 474.47 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 591.04 us = 0.25% latency, 545.01 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.89 us = 0.25% latency, 551.69 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 555.75 us = 0.24% latency, 579.61 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 97.27 us = 0.04% latency, 431.18 GFLOPS)
      )
    )
    (22): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 7.74 ms = 3.33% latency, 176.26 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 928.4 us = 0.4% latency, 3.05 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.2 us = 0.02% latency, 1.46 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 238.42 us = 0.1% latency, 11.87 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.83 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 3.5 ms = 1.51% latency, 112.84 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 326.87 us = 0.14% latency, 369.55 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 194.31 us = 0.08% latency, 155.42 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 168.09 us = 0.07% latency, 179.66 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 164.75 us = 0.07% latency, 183.3 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 412.23 us = 0.18% latency, 293.03 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 446.8 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.15 ms = 0.93% latency, 449.63 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 667.57 us = 0.29% latency, 482.53 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 606.78 us = 0.26% latency, 530.88 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 557.66 us = 0.24% latency, 577.63 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 98.47 us = 0.04% latency, 425.96 GFLOPS)
      )
    )
    (23): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.94 ms = 2.99% latency, 196.63 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 939.85 us = 0.4% latency, 3.01 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.44 us = 0.02% latency, 1.45 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 245.81 us = 0.11% latency, 11.52 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 439.41 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.95 ms = 1.27% latency, 133.87 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 458.24 us = 0.2% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.79 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 328.78 us = 0.14% latency, 367.41 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.92 us = 0.06% latency, 200.1 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.73 us = 0.06% latency, 201.69 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 311.85 us = 0.13% latency, 387.35 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.73 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 484.63 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 581.26 us = 0.25% latency, 554.18 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 578.17 us = 0.25% latency, 557.15 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 546.46 us = 0.24% latency, 589.48 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.7 us = 0.04% latency, 447.64 GFLOPS)
      )
    )
    (24): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.85 ms = 2.95% latency, 199.11 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 890.02 us = 0.38% latency, 3.18 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.02% latency, 1.7 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 224.11 us = 0.1% latency, 12.63 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.16 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.91 ms = 1.25% latency, 135.63 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.37 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.55 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.48 us = 0.14% latency, 378.1 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.17 us = 0.08% latency, 157.15 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.83 us = 0.08% latency, 159.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.44 us = 0.06% latency, 200.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 148.53 us = 0.06% latency, 203.31 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.04 us = 0.13% latency, 392.15 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.44 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.09 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.99 us = 0.25% latency, 548.77 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.46 us = 0.25% latency, 548.33 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.4 us = 0.23% latency, 593.88 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.22 us = 0.04% latency, 449.93 GFLOPS)
      )
    )
    (25): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.85 ms = 2.95% latency, 199.09 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 908.37 us = 0.39% latency, 3.12 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 46.49 us = 0.02% latency, 1.32 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 227.69 us = 0.1% latency, 12.43 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.59 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 136 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 442.98 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.79 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.24 us = 0.14% latency, 378.38 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.45 us = 0.08% latency, 157.74 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.16 us = 0.07% latency, 199.79 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.73 us = 0.06% latency, 201.69 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 305.65 us = 0.13% latency, 395.21 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.78 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 484.28 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.99 us = 0.25% latency, 548.77 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.7 us = 0.25% latency, 552.82 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.54 us = 0.23% latency, 597.03 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS)
      )
    )
    (26): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.92 ms = 2.98% latency, 196.97 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 906.23 us = 0.39% latency, 3.12 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 39.1 us = 0.02% latency, 1.57 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 230.55 us = 0.1% latency, 12.28 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 436.54 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.96 ms = 1.27% latency, 133.41 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.89 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.31 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 322.1 us = 0.14% latency, 375.02 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 231.27 us = 0.1% latency, 130.58 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.64 us = 0.08% latency, 156.76 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.3 us = 0.07% latency, 196.99 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.49 us = 0.06% latency, 202.02 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 303.98 us = 0.13% latency, 397.38 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.54 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.86% latency, 481.17 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.85 us = 0.25% latency, 546.11 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.75 us = 0.25% latency, 549 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 543.83 us = 0.23% latency, 592.32 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.41 us = 0.04% latency, 444.25 GFLOPS)
      )
    )
    (27): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.85 ms = 2.95% latency, 199.21 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 893.35 us = 0.38% latency, 3.17 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.02% latency, 1.67 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 226.26 us = 0.1% latency, 12.51 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.64 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.92 ms = 1.26% latency, 135.2 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.65 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.31 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.5 us = 0.08% latency, 158.53 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.59 us = 0.08% latency, 160.13 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 162.84 us = 0.07% latency, 185.45 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.44 us = 0.06% latency, 200.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 306.61 us = 0.13% latency, 393.98 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.06 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 485.73 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.41 us = 0.25% latency, 552.14 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.83 us = 0.25% latency, 555.54 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.06 us = 0.23% latency, 597.56 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.94 us = 0.04% latency, 446.5 GFLOPS)
      )
    )
    (28): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.84 ms = 2.95% latency, 199.39 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 887.39 us = 0.38% latency, 3.19 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.57 us = 0.01% latency, 1.78 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 222.21 us = 0.1% latency, 12.74 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 135.85 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.41 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.98 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 321.39 us = 0.14% latency, 375.86 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.59 us = 0.08% latency, 160.13 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.44 us = 0.06% latency, 200.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.68 us = 0.06% latency, 200.42 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 307.08 us = 0.13% latency, 393.37 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.26 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.61 us = 0.25% latency, 546.33 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.03 us = 0.25% latency, 549.67 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 540.02 us = 0.23% latency, 596.5 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.94 us = 0.04% latency, 446.5 GFLOPS)
      )
    )
    (29): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.83 ms = 2.94% latency, 199.75 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 891.69 us = 0.38% latency, 3.18 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.02% latency, 1.74 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 224.83 us = 0.1% latency, 12.59 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.4 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 135.92 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.31 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 189.54 us = 0.08% latency, 159.33 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.68 us = 0.06% latency, 200.42 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.01 us = 0.06% latency, 202.66 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.04 us = 0.13% latency, 392.15 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.78 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 485.44 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.75 us = 0.25% latency, 549 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.46 us = 0.25% latency, 553.04 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 536.68 us = 0.23% latency, 600.21 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.94 us = 0.04% latency, 446.5 GFLOPS)
      )
    )
    (30): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.82 ms = 2.94% latency, 199.89 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 884.29 us = 0.38% latency, 3.2 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.57 us = 0.01% latency, 1.78 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 221.25 us = 0.1% latency, 12.8 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.44 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 136.03 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.7 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.03 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.97 us = 0.06% latency, 201.37 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.73 us = 0.06% latency, 201.69 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 305.18 us = 0.13% latency, 395.82 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 483.82 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 588.42 us = 0.25% latency, 547.44 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.32 us = 0.25% latency, 550.34 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 538.83 us = 0.23% latency, 597.82 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.7 us = 0.04% latency, 447.64 GFLOPS)
      )
    )
    (31): DiTLayer(
      250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.83 ms = 2.94% latency, 199.61 TFLOPS
      (input_layernorm): AdaLayerNormZero(
        88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 894.07 us = 0.38% latency, 3.17 TFLOPS
        (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.02% latency, 1.67 GFLOPS)
        (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 225.31 us = 0.1% latency, 12.57 TFLOPS, in_features=3840, out_features=23040, bias=True)
        (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.4 us = 0.19% latency, 0 FLOPS)
      )
      (self_attn): DiTSelfAttention(
        44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.91 ms = 1.25% latency, 135.73 TFLOPS
        (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS)
        (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.55 us = 0.1% latency, 0 FLOPS)
        (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False)
        (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False)
        (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.5 us = 0.07% latency, 195.47 TFLOPS, in_features=3840, out_features=960, bias=False)
        (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.2 us = 0.06% latency, 201.05 TFLOPS, in_features=3840, out_features=960, bias=False)
        (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.99 us = 0.13% latency, 390.94 TFLOPS, in_features=3840, out_features=3840, bias=False)
      )
      (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS)
      (mlp): GemmaMLP(
        117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 485.21 TFLOPS
        (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.7 us = 0.25% latency, 552.82 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 580.55 us = 0.25% latency, 554.86 TFLOPS, in_features=3840, out_features=10240, bias=False)
        (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 540.26 us = 0.23% latency, 596.24 TFLOPS, in_features=10240, out_features=3840, bias=False)
        (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.41 us = 0.04% latency, 444.25 GFLOPS)
      )
    )
  )
  (patch_embed): PatchEmbed(
    249.6 K = 0% Params, 1.01 GMACs = 0% MACs, 603.2 us = 0.26% latency, 3.36 TFLOPS
    (proj): Conv2d(249.6 K = 0% Params, 1.01 GMACs = 0% MACs, 370.74 us = 0.16% latency, 5.47 TFLOPS, 16, 3840, kernel_size=(2, 2), stride=(2, 2))
  )
  (rotary_emb): GemmaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 s = 0% latency, 0 FLOPS)
  (time_proj): Timesteps(0 = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.1% latency, 0 FLOPS)
  (timestep_embedder): Sequential(
    15.74 M = 0.19% Params, 251.66 MMACs = 0% MACs, 539.06 us = 0.23% latency, 933.8 GFLOPS
    (0): Linear(986.88 K = 0.01% Params, 15.73 MMACs = 0% MACs, 231.27 us = 0.1% latency, 136.02 GFLOPS, in_features=256, out_features=3840, bias=True)
    (1): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 47.68 us = 0.02% latency, 1.29 GFLOPS)
    (2): Linear(14.75 M = 0.18% Params, 235.93 MMACs = 0% MACs, 186.44 us = 0.08% latency, 2.53 TFLOPS, in_features=3840, out_features=3840, bias=True)
  )
  (context_embedder): Sequential(
    7.87 M = 0.1% Params, 32.21 GMACs = 0.15% MACs, 479.94 us = 0.21% latency, 134.24 TFLOPS
    (0): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 177.62 us = 0.08% latency, 0 FLOPS)
    (1): Linear(7.87 M = 0.1% Params, 32.21 GMACs = 0.15% MACs, 252.25 us = 0.11% latency, 255.4 TFLOPS, in_features=2048, out_features=3840, bias=True)
  )
  (norm_out): AdaLayerNormOut(
    29.5 M = 0.36% Params, 471.86 MMACs = 0% MACs, 845.19 us = 0.36% latency, 1.12 TFLOPS
    (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.43 us = 0.02% latency, 1.64 GFLOPS)
    (linear): Linear(29.5 M = 0.36% Params, 471.86 MMACs = 0% MACs, 172.62 us = 0.07% latency, 5.47 TFLOPS, in_features=3840, out_features=7680, bias=True)
    (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.97 us = 0.19% latency, 0 FLOPS)
  )
  (proj_out): Linear(245.82 K = 0% Params, 1.01 GMACs = 0% MACs, 176.43 us = 0.08% latency, 11.41 TFLOPS, in_features=3840, out_features=64, bias=True)
  (repa_projector): Sequential(
    13.64 M = 0.17% Params, 55.83 GMACs = 0.25% MACs, 751.73 us = 0.32% latency, 148.57 TFLOPS
    (0): Linear(7.87 M = 0.1% Params, 32.21 GMACs = 0.15% MACs, 267.27 us = 0.12% latency, 241.05 TFLOPS, in_features=3840, out_features=2048, bias=True)
    (1): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 246.04 GFLOPS)
    (2): Linear(4.2 M = 0.05% Params, 17.18 GMACs = 0.08% MACs, 173.81 us = 0.07% latency, 197.69 TFLOPS, in_features=2048, out_features=2048, bias=True)
    (3): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 28.85 us = 0.01% latency, 290.78 GFLOPS)
    (4): Linear(1.57 M = 0.02% Params, 6.44 GMACs = 0.03% MACs, 147.82 us = 0.06% latency, 87.17 TFLOPS, in_features=2048, out_features=768, bias=True)
  )
)
------------------------------------------------------------------------------