-------------------------- DeepSpeed Flops Profiler -------------------------- Profile Summary at step 2: Notations: data parallel size (dp_size), model parallel size(mp_size), number of parameters (params), number of multiply-accumulate operations(MACs), number of floating-point operations (flops), floating-point operations per second (FLOPS), fwd latency (forward propagation latency), bwd latency (backward propagation latency), step (weights update latency), iter latency (sum of fwd, bwd and step latency) world size: 32 data parallel size: 32 model parallel size: 1 batch size per GPU: 16 params per GPU: 8.08 B params of model = params per GPU * mp_size: 8.08 B fwd MACs per GPU: 26.2 TMACs fwd flops per GPU: 52.41 T fwd flops of model = fwd flops per GPU * mp_size: 52.41 T fwd latency: 427.35 ms fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 122.64 TFLOPS bwd latency: 1.01 s bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 103.73 TFLOPS fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 109.35 TFLOPS step latency: 397.53 ms iter latency: 1.84 s FLOPS per GPU = 3 * fwd flops per GPU / iter latency: 85.67 TFLOPS samples/second: 278.96 ----------------------------- Aggregated Profile per GPU ----------------------------- Top 1 modules in terms of params, MACs or fwd latency at different model depths: depth 0: params - {'DiT': '8.08 B'} MACs - {'DiT': '26.2 TMACs'} fwd latency - {'DiT': '427.17 ms'} depth 1: params - {'ModuleList': '8.05 B'} MACs - {'ModuleList': '26.15 TMACs'} fwd latency - {'ModuleList': '406.48 ms'} depth 2: params - {'DiTLayer': '8.05 B'} MACs - {'DiTLayer': '26.15 TMACs'} fwd latency - {'DiTLayer': '406.48 ms'} depth 3: params - {'GemmaMLP': '4.03 B'} MACs - {'GemmaMLP': '16.49 TMACs'} fwd latency - {'DiTSelfAttention': '213.37 ms'} ------------------------------ Detailed Profile per GPU ------------------------------ Each module profile is listed after its name in the following order: params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs (or latency) and the sum of its submodules'. 2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput. 3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed. DiT( 8.08 B = 100% Params, 26.2 TMACs = 100% MACs, 427.17 ms = 100% latency, 122.7 TFLOPS (layers): ModuleList( (0): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.26 ms = 1.23% latency, 124.34 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 658.51 us = 0.15% latency, 1.22 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 40.77 us = 0.01% latency, 803.74 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 201.94 us = 0.05% latency, 3.99 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 247.24 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.74 ms = 0.64% latency, 87.64 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 467.3 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 276.8 us = 0.06% latency, 248.26 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 161.41 us = 0.04% latency, 106.44 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.53 us = 0.03% latency, 115.66 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.09 us = 0.03% latency, 125.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.38 us = 0.06% latency, 281.2 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.37 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.22 ms = 0.29% latency, 336.68 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 342.37 us = 0.08% latency, 401.44 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.34 us = 0.07% latency, 433.1 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.83 us = 0.07% latency, 463.02 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 89.65 us = 0.02% latency, 374.3 GFLOPS) ) ) (1): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.13 ms = 1.2% latency, 127.53 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 616.79 us = 0.14% latency, 1.31 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.75 us = 0.04% latency, 4.74 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.68 ms = 0.63% latency, 89.59 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 260.83 us = 0.06% latency, 263.46 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.93 us = 0.04% latency, 110.18 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.33 us = 0.03% latency, 125.1 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.74 us = 0.05% latency, 303.08 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.28 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.21 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.78 us = 0.08% latency, 418.03 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 305.18 us = 0.07% latency, 450.36 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 83.45 us = 0.02% latency, 402.11 GFLOPS) ) ) (2): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.73 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 608.92 us = 0.14% latency, 1.32 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.62 us = 0.01% latency, 848.39 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 176.43 us = 0.04% latency, 4.56 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.48 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.58 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.82 us = 0.06% latency, 273.98 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.02 us = 0.04% latency, 111.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.34 us = 0.03% latency, 116.6 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.99 us = 0.03% latency, 128.22 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 228.88 us = 0.05% latency, 300.24 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.67 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.97 us = 0.08% latency, 416.52 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.06 us = 0.07% latency, 470.58 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (3): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.7 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 610.83 us = 0.14% latency, 1.32 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.39 us = 0.01% latency, 853.66 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 174.52 us = 0.04% latency, 4.61 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.33 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.62% latency, 90.23 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.23 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.72 us = 0.06% latency, 277.41 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.78 us = 0.04% latency, 111.72 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 233.17 us = 0.05% latency, 294.71 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.37 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.02 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.06 us = 0.08% latency, 418.94 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (4): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.76 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 594.85 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.99 us = 0.04% latency, 4.88 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.28 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.44 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.54 us = 0.04% latency, 111.89 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 138.28 us = 0.03% latency, 124.24 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.55 us = 0.05% latency, 298.07 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.28 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.25 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.12 us = 0.08% latency, 413.83 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 316.14 us = 0.07% latency, 434.74 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS) ) ) (5): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.16 ms = 1.21% latency, 126.76 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 655.41 us = 0.15% latency, 1.23 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.23 us = 0.04% latency, 4.73 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.29 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.68 ms = 0.63% latency, 89.76 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.44 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.37 us = 0.05% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 259.64 us = 0.06% latency, 264.67 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.96 us = 0.03% latency, 118.52 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.27 us = 0.05% latency, 297.14 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.18 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.39 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.64 us = 0.08% latency, 414.42 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.67 us = 0.07% latency, 435.39 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS) ) ) (6): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.88 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 601.05 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.67 us = 0.01% latency, 869.87 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.33 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.72 us = 0.06% latency, 277.41 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.02 us = 0.04% latency, 111.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.85 us = 0.03% latency, 125.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.84 us = 0.05% latency, 298.99 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.66 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.18 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.54 us = 0.08% latency, 418.33 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.24 us = 0.07% latency, 437.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS) ) ) (7): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.12 ms = 1.2% latency, 127.66 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 610.59 us = 0.14% latency, 1.32 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 41.96 us = 0.01% latency, 780.9 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.23 us = 0.04% latency, 4.73 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 89.98 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.44 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.7 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 252.49 us = 0.06% latency, 272.17 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.21 us = 0.04% latency, 110.69 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.77 us = 0.03% latency, 115.48 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.09 us = 0.03% latency, 125.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.5 us = 0.05% latency, 303.4 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.22 ms = 0.29% latency, 338.26 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 335.93 us = 0.08% latency, 409.13 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 324.49 us = 0.08% latency, 423.56 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 298.26 us = 0.07% latency, 460.8 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.54 us = 0.02% latency, 411.51 GFLOPS) ) ) (8): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.13 ms = 1.2% latency, 127.36 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 587.46 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 32.66 us = 0.01% latency, 1 GFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 156.88 us = 0.04% latency, 5.13 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.7 ms = 0.63% latency, 89.09 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.15 us = 0.06% latency, 275.82 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.73 us = 0.04% latency, 111.03 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.3 us = 0.03% latency, 115.85 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 140.67 us = 0.03% latency, 122.13 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 138.52 us = 0.03% latency, 124.02 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.38 us = 0.06% latency, 281.2 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.22 ms = 0.29% latency, 338.26 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 341.65 us = 0.08% latency, 402.28 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 320.2 us = 0.07% latency, 429.23 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 298.26 us = 0.07% latency, 460.8 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.78 us = 0.02% latency, 410.31 GFLOPS) ) ) (9): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.17 ms = 1.21% latency, 126.52 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 603.68 us = 0.14% latency, 1.33 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.28 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.66 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.72 ms = 0.64% latency, 88.39 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.68 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.15 us = 0.06% latency, 275.82 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 156.16 us = 0.04% latency, 110.01 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 141.14 us = 0.03% latency, 121.72 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 238.42 us = 0.06% latency, 288.23 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.21 ms = 0.28% latency, 339.45 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 337.36 us = 0.08% latency, 407.39 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 319.72 us = 0.07% latency, 429.87 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 300.65 us = 0.07% latency, 457.15 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.54 us = 0.02% latency, 411.51 GFLOPS) ) ) (10): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.13 ms = 1.2% latency, 127.47 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 622.51 us = 0.15% latency, 1.29 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 43.63 us = 0.01% latency, 751.03 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 181.44 us = 0.04% latency, 4.44 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.57 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.31 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.96 us = 0.06% latency, 277.14 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.54 us = 0.04% latency, 111.89 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.85 us = 0.03% latency, 125.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.22 ms = 0.29% latency, 337.47 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 334.02 us = 0.08% latency, 411.46 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 320.43 us = 0.08% latency, 428.91 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 298.5 us = 0.07% latency, 460.43 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.3 us = 0.02% latency, 412.72 GFLOPS) ) ) (11): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.1 ms = 1.19% latency, 128.3 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 612.26 us = 0.14% latency, 1.32 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 46.73 us = 0.01% latency, 701.22 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.8 us = 0.04% latency, 4.77 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 90 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.7 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.82 us = 0.06% latency, 273.98 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 156.88 us = 0.04% latency, 109.51 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.53 us = 0.03% latency, 115.66 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.36 us = 0.05% latency, 299.62 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.84 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.87 us = 0.08% latency, 420.47 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.4 us = 0.07% latency, 465.26 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.54 us = 0.02% latency, 411.51 GFLOPS) ) ) (12): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.09 ms = 1.19% latency, 128.3 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 614.88 us = 0.14% latency, 1.31 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 40.29 us = 0.01% latency, 813.25 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 173.33 us = 0.04% latency, 4.65 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.62% latency, 90.11 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 248.91 us = 0.06% latency, 276.08 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.09 us = 0.03% latency, 125.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 143.29 us = 0.03% latency, 119.9 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.18 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.54 us = 0.08% latency, 418.33 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.76 us = 0.07% latency, 438.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS) ) ) (13): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.1 ms = 1.19% latency, 128.23 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 606.06 us = 0.14% latency, 1.33 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 177.15 us = 0.04% latency, 4.55 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.62% latency, 90.21 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.58 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.08 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 252.49 us = 0.06% latency, 272.17 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.73 us = 0.04% latency, 111.03 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.79 us = 0.05% latency, 297.76 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.21 ms = 0.28% latency, 341.67 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 333.31 us = 0.08% latency, 412.35 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 319.24 us = 0.07% latency, 430.52 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.83 us = 0.07% latency, 463.02 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.3 us = 0.02% latency, 412.72 GFLOPS) ) ) (14): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.11 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 592.95 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.65 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.19 us = 0.06% latency, 282.58 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.74 us = 0.05% latency, 296.53 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.46 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.19 us = 0.07% latency, 436.05 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.59 us = 0.07% latency, 463.39 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS) ) ) (15): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.86 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 587.7 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.65 us = 0.04% latency, 4.98 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 90.02 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 472.78 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.99 us = 0.06% latency, 283.97 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.26 us = 0.04% latency, 111.37 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.5 us = 0.05% latency, 296.84 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.76 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.63 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.21 us = 0.08% latency, 416.22 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.43 us = 0.07% latency, 435.72 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS) ) ) (16): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.13 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.18 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.89 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.31 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 259.64 us = 0.06% latency, 264.67 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.63 us = 0.04% latency, 113.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.63 us = 0.03% latency, 117.17 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.04 us = 0.03% latency, 129.14 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.78 us = 0.05% latency, 304.36 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.77 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.45 us = 0.08% latency, 415.92 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (17): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.11 ms = 1.2% latency, 127.96 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 593.42 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.27 us = 0.04% latency, 4.9 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.68 ms = 0.63% latency, 89.86 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 238.42 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.52 us = 0.06% latency, 284.53 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.83 us = 0.04% latency, 112.41 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.42 us = 0.03% latency, 126.86 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.21 ms = 0.28% latency, 340.52 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 337.6 us = 0.08% latency, 407.11 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.1 us = 0.07% latency, 433.43 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.12 us = 0.07% latency, 464.14 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 83.45 us = 0.02% latency, 402.11 GFLOPS) ) ) (18): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.91 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 601.53 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.2 us = 0.01% latency, 776.49 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.32 us = 0.04% latency, 4.93 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.65 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.49 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.86 us = 0.06% latency, 280.65 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.63 us = 0.04% latency, 113.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.42 us = 0.03% latency, 126.86 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.07 us = 0.05% latency, 298.68 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.21 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.73 us = 0.08% latency, 416.82 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 318.29 us = 0.07% latency, 431.81 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS) ) ) (19): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.97 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 597.24 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.01% latency, 910.19 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.99 us = 0.04% latency, 4.88 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.63 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.58 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.52 us = 0.06% latency, 278.75 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 228.4 us = 0.05% latency, 300.87 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.09 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.12 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.26 us = 0.08% latency, 417.42 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.71 us = 0.07% latency, 436.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.12 us = 0.07% latency, 464.14 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS) ) ) (20): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.41 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.89 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.73 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.68 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.86 us = 0.06% latency, 280.65 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.78 us = 0.04% latency, 111.72 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.07 us = 0.05% latency, 305.33 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.97 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.26 us = 0.08% latency, 417.42 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.71 us = 0.07% latency, 436.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS) ) ) (21): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.1 ms = 1.19% latency, 128.21 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 604.15 us = 0.14% latency, 1.33 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.91 us = 0.01% latency, 864.4 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.99 us = 0.04% latency, 4.74 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.86 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 90 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 465.15 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.05 us = 0.06% latency, 279.29 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.5 us = 0.04% latency, 111.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.36 us = 0.05% latency, 299.62 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.89 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.02 us = 0.08% latency, 417.73 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.4 us = 0.08% latency, 421.08 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.3 us = 0.02% latency, 412.72 GFLOPS) ) ) (22): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.6 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 600.81 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.61 us = 0.04% latency, 4.8 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.81 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.35 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.62 us = 0.06% latency, 275.29 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.93 us = 0.04% latency, 110.18 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.67 us = 0.03% latency, 117.93 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.56 us = 0.03% latency, 129.6 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 237.94 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 343.43 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.4 us = 0.08% latency, 421.08 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.76 us = 0.07% latency, 438.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 303.75 us = 0.07% latency, 452.48 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (23): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.12 ms = 1.2% latency, 127.66 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 611.07 us = 0.14% latency, 1.32 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.43 us = 0.01% latency, 875.41 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.28 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.81 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.63% latency, 89.97 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.34 us = 0.06% latency, 274.51 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.93 us = 0.04% latency, 110.18 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.3 us = 0.03% latency, 115.85 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.12 us = 0.05% latency, 299.93 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.68 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 333.55 us = 0.08% latency, 412.05 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 316.62 us = 0.07% latency, 434.08 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.12 us = 0.07% latency, 464.14 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS) ) ) (24): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.77 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 594.14 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.67 ms = 0.62% latency, 90.22 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.92 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.13 us = 0.05% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.86 us = 0.06% latency, 280.65 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.69 us = 0.04% latency, 110.35 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.34 us = 0.03% latency, 116.6 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.27 us = 0.05% latency, 297.14 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.39 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.83 us = 0.08% latency, 412.94 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS) ) ) (25): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.9 ms = 1.38% latency, 110.71 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 592.71 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.75 us = 0.04% latency, 4.89 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 3.43 ms = 0.8% latency, 70.05 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 541.93 us = 0.13% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 307.56 us = 0.07% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 257.02 us = 0.06% latency, 267.38 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.21 us = 0.04% latency, 110.69 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.58 us = 0.03% latency, 116.41 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 385.28 us = 0.09% latency, 44.59 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 269.17 us = 0.06% latency, 63.82 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 252.72 us = 0.06% latency, 271.92 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.25 ms = 0.29% latency, 329.75 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 343.32 us = 0.08% latency, 400.32 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.35 us = 0.07% latency, 463.77 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 87.74 us = 0.02% latency, 382.44 GFLOPS) ) ) (26): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.12 ms = 1.2% latency, 127.59 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 639.92 us = 0.15% latency, 1.26 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.32 us = 0.04% latency, 4.78 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 281.33 us = 0.07% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.26 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.08 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 251.53 us = 0.06% latency, 273.2 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.61 us = 0.03% latency, 125.75 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.22 us = 0.05% latency, 295.92 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.09 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 343.7 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.87 us = 0.08% latency, 420.47 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.83 us = 0.08% latency, 419.24 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (27): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.46 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.89 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.17 us = 0.04% latency, 5 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.14 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.52 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.05 us = 0.06% latency, 279.29 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.3 us = 0.04% latency, 112.06 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 143.53 us = 0.03% latency, 119.7 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.99 us = 0.03% latency, 128.22 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.45 us = 0.05% latency, 302.13 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.87 us = 0.08% latency, 420.47 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.69 us = 0.07% latency, 466.39 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (28): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.65 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 617.98 us = 0.14% latency, 1.3 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 47.92 us = 0.01% latency, 683.78 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 173.57 us = 0.04% latency, 4.64 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.29 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.73 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.96 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.1 us = 0.06% latency, 274.77 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.78 us = 0.04% latency, 111.72 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.63 us = 0.03% latency, 117.17 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 223.88 us = 0.05% latency, 306.95 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.53 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.26 us = 0.08% latency, 417.42 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 318.29 us = 0.07% latency, 431.81 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (29): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.06 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.18 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.65 us = 0.04% latency, 4.98 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.35 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.48 us = 0.06% latency, 277.68 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.11 us = 0.04% latency, 112.94 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.67 us = 0.03% latency, 117.93 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 237.46 us = 0.06% latency, 289.39 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.73 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.12 us = 0.08% latency, 413.83 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.57 us = 0.07% latency, 432.78 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS) ) ) (30): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.78 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.66 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.57 us = 0.01% latency, 947.85 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.12 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.05 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.85 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.58 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 239.85 us = 0.06% latency, 286.51 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.63 us = 0.04% latency, 113.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 142.81 us = 0.03% latency, 120.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.71 us = 0.03% latency, 127.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 142.81 us = 0.03% latency, 120.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.79 us = 0.05% latency, 297.76 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.64 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.45 us = 0.08% latency, 415.92 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.33 us = 0.07% latency, 440.05 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS) ) ) (31): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.89 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 607.25 us = 0.14% latency, 1.33 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.79 us = 0.04% latency, 4.92 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.01 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 233.41 us = 0.05% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144 us = 0.03% latency, 119.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.34 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.4 us = 0.08% latency, 421.08 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.69 us = 0.07% latency, 466.39 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS) ) ) (32): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.66 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 592.95 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 33.38 us = 0.01% latency, 981.71 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.66 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.79 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.92 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.81 us = 0.06% latency, 279.56 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 155.45 us = 0.04% latency, 110.52 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.42 us = 0.03% latency, 126.86 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 228.17 us = 0.05% latency, 301.18 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 347.99 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.83 us = 0.08% latency, 419.24 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.76 us = 0.07% latency, 438.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (33): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.94 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.81 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.92 us = 0.01% latency, 763.55 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.99 us = 0.04% latency, 4.88 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.28 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.09 us = 0.06% latency, 286.23 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.06 us = 0.04% latency, 112.24 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.78 us = 0.05% latency, 304.36 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.92 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 326.87 us = 0.08% latency, 420.47 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.76 us = 0.07% latency, 438.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (34): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.94 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 605.34 us = 0.14% latency, 1.33 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.43 us = 0.01% latency, 875.41 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.23 us = 0.04% latency, 4.73 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.52 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.61 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.82 us = 0.06% latency, 273.98 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.83 us = 0.04% latency, 112.41 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.66 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.39 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.3 us = 0.08% latency, 418.64 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.24 us = 0.07% latency, 437.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (35): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.26 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 600.58 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 171.18 us = 0.04% latency, 4.7 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.81 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.78 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.29 us = 0.06% latency, 279.02 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.54 us = 0.04% latency, 111.89 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.67 us = 0.03% latency, 117.93 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.38 us = 0.03% latency, 125.97 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.78 us = 0.05% latency, 304.36 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.28 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.36 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.35 us = 0.08% latency, 419.85 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314 us = 0.07% latency, 437.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.92 us = 0.07% latency, 466.02 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (36): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.83 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.33 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.01% latency, 910.19 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.36 us = 0.04% latency, 4.96 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.05 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.96 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.43 us = 0.06% latency, 282.3 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 149.73 us = 0.04% latency, 114.74 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 348.83 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 324.49 us = 0.08% latency, 423.56 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314 us = 0.07% latency, 437.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS) ) ) (37): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.11 ms = 1.2% latency, 128 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 600.58 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.01% latency, 892.46 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.85 us = 0.04% latency, 4.8 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.68 ms = 0.63% latency, 89.76 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.44 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.6 us = 0.05% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 262.74 us = 0.06% latency, 261.55 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 156.88 us = 0.04% latency, 109.51 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.31 us = 0.05% latency, 298.38 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.34 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.49 us = 0.08% latency, 417.12 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.52 us = 0.07% latency, 438.37 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 306.84 us = 0.07% latency, 447.91 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 81.06 us = 0.02% latency, 413.93 GFLOPS) ) ) (38): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.53 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 593.66 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.01% latency, 904.2 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.36 us = 0.04% latency, 4.96 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.33 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.06 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.84 us = 0.05% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.71 us = 0.06% latency, 283.13 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 228.17 us = 0.05% latency, 301.18 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.35 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.59 us = 0.08% latency, 413.23 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.19 us = 0.07% latency, 436.05 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.39 us = 0.02% latency, 422.64 GFLOPS) ) ) (39): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.53 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.19 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.09 us = 0.04% latency, 4.79 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.52 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.01 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.47 us = 0.06% latency, 283.41 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144 us = 0.03% latency, 119.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.25 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.45 us = 0.08% latency, 415.92 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 316.38 us = 0.07% latency, 434.41 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (40): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.07 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.6 us = 0.04% latency, 4.95 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.26 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 465.63 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 260.11 us = 0.06% latency, 264.19 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.26 us = 0.04% latency, 111.37 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.72 us = 0.03% latency, 118.71 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.21 us = 0.05% latency, 302.45 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 348.48 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.35 us = 0.08% latency, 419.85 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.28 us = 0.07% latency, 438.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (41): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.97 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.91 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.01% latency, 892.46 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.04 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.62 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.31 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.23 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 251.53 us = 0.06% latency, 273.2 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.97 us = 0.04% latency, 110.86 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.34 us = 0.03% latency, 116.6 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.57 us = 0.03% latency, 124.88 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.71 us = 0.03% latency, 127.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.74 us = 0.05% latency, 303.08 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.74 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.16 us = 0.08% latency, 415.02 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.81 us = 0.07% latency, 439.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS) ) ) (42): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.06 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 601.05 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.01% latency, 904.2 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.52 us = 0.04% latency, 4.75 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.14 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.42 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.24 us = 0.06% latency, 277.95 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 130.65 us = 0.03% latency, 131.49 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.7 us = 0.05% latency, 295.32 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.29 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.88 us = 0.08% latency, 414.12 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 311.85 us = 0.07% latency, 440.72 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (43): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.11 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 604.63 us = 0.14% latency, 1.33 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 44.82 us = 0.01% latency, 731.06 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.04 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.54 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 244.14 us = 0.06% latency, 281.47 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.22 us = 0.05% latency, 295.92 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.5 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.83 us = 0.08% latency, 419.24 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.1 us = 0.07% latency, 433.43 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.06 us = 0.07% latency, 470.58 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (44): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.14 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.75 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.49 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.71 us = 0.06% latency, 283.13 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 342.68 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 341.65 us = 0.08% latency, 402.28 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (45): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.11 ms = 1.2% latency, 127.92 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 650.88 us = 0.15% latency, 1.24 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.57 us = 0.01% latency, 947.85 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.65 us = 0.04% latency, 4.98 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.29 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.79 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 238.18 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.05 us = 0.06% latency, 279.29 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.92 us = 0.04% latency, 113.84 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.72 us = 0.03% latency, 118.71 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.11 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.16 us = 0.08% latency, 415.02 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.43 us = 0.07% latency, 435.72 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (46): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.87 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.66 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.33 us = 0.01% latency, 954.44 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.56 us = 0.04% latency, 4.92 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.87 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.71 us = 0.06% latency, 283.13 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.11 us = 0.04% latency, 112.94 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.07 us = 0.05% latency, 305.33 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 348.55 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 324.96 us = 0.08% latency, 422.94 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 310.42 us = 0.07% latency, 442.75 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS) ) ) (47): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.68 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 614.88 us = 0.14% latency, 1.31 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 50.54 us = 0.01% latency, 648.3 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.04 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.57 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 246.52 us = 0.06% latency, 278.75 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.81 us = 0.03% latency, 124.67 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.49 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.59 us = 0.08% latency, 419.55 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.81 us = 0.07% latency, 439.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.69 us = 0.07% latency, 466.39 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS) ) ) (48): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.89 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.67 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.01% latency, 904.2 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.43 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.55 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.84 us = 0.05% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.24 us = 0.06% latency, 277.95 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.59 us = 0.04% latency, 112.59 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.1 us = 0.03% latency, 116.79 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 130.18 us = 0.03% latency, 131.97 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.46 us = 0.05% latency, 295.62 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.99 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 345.01 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.95 us = 0.07% latency, 436.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.83 us = 0.07% latency, 463.02 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (49): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.55 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 590.8 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.01% latency, 910.19 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 164.03 us = 0.04% latency, 4.91 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.28 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.88 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.99 us = 0.06% latency, 283.97 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.92 us = 0.04% latency, 113.84 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 143.29 us = 0.03% latency, 119.9 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.71 us = 0.03% latency, 127.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 140.67 us = 0.03% latency, 122.13 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.39 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 327.83 us = 0.08% latency, 419.24 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.95 us = 0.07% latency, 436.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS) ) ) (50): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.9 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 590.32 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.7 us = 0.04% latency, 4.86 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.19 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.63 ms = 0.62% latency, 91.3 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.43 us = 0.06% latency, 282.3 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.44 us = 0.04% latency, 114.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.96 us = 0.03% latency, 118.52 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.5 us = 0.05% latency, 303.4 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.15 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.02 us = 0.08% latency, 417.73 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 321.63 us = 0.08% latency, 427.32 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (51): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.63 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.57 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.2 us = 0.01% latency, 776.49 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.32 us = 0.04% latency, 4.93 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.23 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 90.96 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.56 us = 0.06% latency, 285.66 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.96 us = 0.03% latency, 118.52 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.37 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.57 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.97 us = 0.08% latency, 416.52 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.24 us = 0.07% latency, 437.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.78 us = 0.07% latency, 469.43 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (52): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.77 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 587.46 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.36 us = 0.04% latency, 4.96 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.92 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.49 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.75 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.19 us = 0.06% latency, 282.58 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.67 us = 0.03% latency, 117.93 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.28 us = 0.03% latency, 128.9 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.98 us = 0.05% latency, 296.23 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.29 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.64 us = 0.08% latency, 414.42 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.28 us = 0.07% latency, 438.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 291.11 us = 0.07% latency, 472.12 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (53): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.75 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 589.85 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.05 us = 0.01% latency, 934.96 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.79 us = 0.04% latency, 4.92 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.05 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.21 us = 0.05% latency, 302.45 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.15 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.02 us = 0.08% latency, 417.73 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.28 us = 0.07% latency, 438.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.16 us = 0.07% latency, 465.64 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (54): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.66 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 588.42 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.17 us = 0.04% latency, 5 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.83 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.23 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.56 us = 0.06% latency, 285.66 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.63 us = 0.04% latency, 113.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.03 us = 0.05% latency, 297.45 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.42 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.81 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.16 us = 0.08% latency, 415.02 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.71 us = 0.07% latency, 436.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (55): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.68 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 585.56 us = 0.14% latency, 1.38 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 961.11 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.36 us = 0.04% latency, 4.96 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.68 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.23 us = 0.06% latency, 283.69 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.34 us = 0.03% latency, 116.6 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.09 us = 0.03% latency, 125.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.55 us = 0.05% latency, 298.07 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 237.94 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.67 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.78 us = 0.08% latency, 418.03 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.64 us = 0.07% latency, 464.89 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (56): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.23 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 593.66 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 43.15 us = 0.01% latency, 759.33 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.12 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.5 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.68 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 253.92 us = 0.06% latency, 270.64 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.59 us = 0.04% latency, 112.59 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.94 us = 0.03% latency, 127.31 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.85 us = 0.03% latency, 130.3 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 232.7 us = 0.05% latency, 295.32 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.46 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.92 us = 0.08% latency, 415.32 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.95 us = 0.07% latency, 436.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (57): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.57 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.57 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 166.18 us = 0.04% latency, 4.85 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.14 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.92 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.7 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.96 us = 0.06% latency, 277.14 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 153.06 us = 0.04% latency, 112.24 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.66 us = 0.03% latency, 126.64 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.37 us = 0.03% latency, 130.78 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 224.11 us = 0.05% latency, 306.63 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.74 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.54 us = 0.08% latency, 418.33 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.19 us = 0.07% latency, 436.05 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (58): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.57 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.43 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.95 us = 0.01% latency, 886.7 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 166.65 us = 0.04% latency, 4.83 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.79 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 234.84 us = 0.05% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.09 us = 0.06% latency, 280.38 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.28 us = 0.03% latency, 128.9 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.74 us = 0.05% latency, 303.08 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 237.46 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.85 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.69 us = 0.08% latency, 415.62 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.52 us = 0.07% latency, 438.37 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.25 us = 0.07% latency, 468.67 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (59): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.13 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 600.34 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.01% latency, 892.46 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.32 us = 0.04% latency, 4.78 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.14 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.44 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.34 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.39 us = 0.06% latency, 275.55 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 164.03 us = 0.04% latency, 104.73 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.71 us = 0.03% latency, 127.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.93 us = 0.05% latency, 301.5 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 238.9 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.15 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.3 us = 0.08% latency, 418.64 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 310.66 us = 0.07% latency, 442.41 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 298.5 us = 0.07% latency, 460.43 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (60): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.74 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 594.62 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.22 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.13 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 465.15 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.66 us = 0.06% latency, 282.03 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.51 us = 0.03% latency, 128.67 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.26 us = 0.05% latency, 303.72 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.71 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.49 us = 0.08% latency, 417.12 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.81 us = 0.07% latency, 439.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.49 us = 0.07% latency, 468.29 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (61): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.56 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 595.09 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 961.11 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.47 us = 0.04% latency, 4.72 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.74 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.81 us = 0.06% latency, 279.56 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.5 us = 0.04% latency, 111.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.63 us = 0.03% latency, 117.17 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.26 us = 0.05% latency, 303.72 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 348.83 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 325.2 us = 0.08% latency, 422.63 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 310.66 us = 0.07% latency, 442.41 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.39 us = 0.02% latency, 422.64 GFLOPS) ) ) (62): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.94 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 601.53 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.81 us = 0.01% latency, 941.36 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 173.57 us = 0.04% latency, 4.64 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.57 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.28 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.53 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.7 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 248.19 us = 0.06% latency, 276.88 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.02 us = 0.04% latency, 111.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 156.64 us = 0.04% latency, 109.68 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 138.04 us = 0.03% latency, 124.45 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.99 us = 0.03% latency, 128.22 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.56 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.73 us = 0.08% latency, 416.82 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 291.11 us = 0.07% latency, 472.12 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (63): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.79 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 608.92 us = 0.14% latency, 1.32 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.39 us = 0.01% latency, 853.66 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 170.95 us = 0.04% latency, 4.71 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.43 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.62 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 464.2 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.96 us = 0.06% latency, 277.14 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.11 us = 0.04% latency, 112.94 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.39 us = 0.03% latency, 117.36 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.56 us = 0.03% latency, 129.6 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 227.21 us = 0.05% latency, 302.45 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 345.49 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 332.59 us = 0.08% latency, 413.23 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.97 us = 0.07% latency, 467.53 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS) ) ) (64): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.19% latency, 129.06 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 599.62 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.51 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.96 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 249.86 us = 0.06% latency, 275.03 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.73 us = 0.04% latency, 111.03 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.74 us = 0.05% latency, 303.08 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.56 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.26 us = 0.08% latency, 417.42 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.45 us = 0.07% latency, 466.77 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (65): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.2 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 593.9 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.33 us = 0.01% latency, 954.44 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 168.8 us = 0.04% latency, 4.77 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.46 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.48 us = 0.06% latency, 277.68 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.83 us = 0.04% latency, 112.41 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.48 us = 0.03% latency, 118.91 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.31 us = 0.05% latency, 305.01 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.04 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.67 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.06 us = 0.08% latency, 418.94 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.04 us = 0.07% latency, 439.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.88 us = 0.07% latency, 464.51 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS) ) ) (66): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 129.04 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 603.68 us = 0.14% latency, 1.33 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.01% latency, 910.19 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 172.38 us = 0.04% latency, 4.67 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.76 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.85 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.33 us = 0.06% latency, 280.11 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.47 us = 0.03% latency, 127.76 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.37 us = 0.03% latency, 130.78 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 231.27 us = 0.05% latency, 297.14 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.37 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.6 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.73 us = 0.08% latency, 416.82 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 317.34 us = 0.07% latency, 433.1 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 296.12 us = 0.07% latency, 464.14 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.82 us = 0.02% latency, 415.15 GFLOPS) ) ) (67): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.79 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 589.13 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.76 us = 0.01% latency, 916.26 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 161.89 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.71 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.11 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.09 us = 0.06% latency, 286.23 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.59 us = 0.04% latency, 112.59 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.18 us = 0.03% latency, 127.09 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.6 us = 0.05% latency, 299.3 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.46 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.16 us = 0.08% latency, 415.02 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.47 us = 0.07% latency, 437.04 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.02 us = 0.07% latency, 469.05 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (68): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.33 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.52 us = 0.01% latency, 922.41 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 166.65 us = 0.04% latency, 4.83 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.92 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.01 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.76 us = 0.06% latency, 284.25 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.4 us = 0.04% latency, 113.48 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 143.77 us = 0.03% latency, 119.5 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.33 us = 0.03% latency, 125.1 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 130.41 us = 0.03% latency, 131.73 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 234.84 us = 0.05% latency, 292.62 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.8 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.25 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 335.93 us = 0.08% latency, 409.13 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 316.62 us = 0.07% latency, 434.08 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.25 us = 0.07% latency, 468.67 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (69): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.72 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 587.22 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.81 us = 0.01% latency, 941.36 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.08 us = 0.04% latency, 4.94 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 90.96 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.34 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 237.23 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 243.66 us = 0.06% latency, 282.03 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.1 us = 0.03% latency, 116.79 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.32 us = 0.03% latency, 129.83 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.84 us = 0.05% latency, 298.99 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.33 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.04 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.21 us = 0.08% latency, 416.22 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314 us = 0.07% latency, 437.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.88 us = 0.07% latency, 464.51 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.35 us = 0.02% latency, 417.62 GFLOPS) ) ) (70): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.9 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 612.26 us = 0.14% latency, 1.32 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.01% latency, 904.2 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.04 us = 0.04% latency, 4.76 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.29 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.64 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 245.81 us = 0.06% latency, 279.56 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.13 us = 0.03% latency, 131.01 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 234.6 us = 0.05% latency, 292.92 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.61 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 328.78 us = 0.08% latency, 418.03 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.71 us = 0.07% latency, 436.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (71): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.07 ms = 1.19% latency, 128.85 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 608.21 us = 0.14% latency, 1.32 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.92 us = 0.01% latency, 763.55 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 169.52 us = 0.04% latency, 4.75 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 246.05 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.49 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.06 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.08 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.72 us = 0.06% latency, 277.41 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 152.59 us = 0.04% latency, 112.59 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.2 us = 0.03% latency, 118.32 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 137.81 us = 0.03% latency, 124.67 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.6 us = 0.05% latency, 299.3 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.95 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.4 us = 0.08% latency, 414.72 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.52 us = 0.07% latency, 438.37 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.73 us = 0.07% latency, 467.91 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (72): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.08 ms = 1.19% latency, 128.72 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 596.28 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.01% latency, 898.29 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 167.37 us = 0.04% latency, 4.81 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.43 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.66 ms = 0.62% latency, 90.59 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 466.11 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 250.34 us = 0.06% latency, 274.51 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 154.26 us = 0.04% latency, 111.37 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.87 us = 0.03% latency, 116.98 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.5 us = 0.05% latency, 303.4 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.99 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.39 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 337.12 us = 0.08% latency, 407.68 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (73): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.67 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 591.75 us = 0.14% latency, 1.36 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.01% latency, 928.64 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.79 us = 0.04% latency, 4.92 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 242.47 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 91.19 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.56 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.95 us = 0.06% latency, 282.86 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.24 us = 0.03% latency, 119.1 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 226.26 us = 0.05% latency, 303.72 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.5 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.45 us = 0.08% latency, 415.92 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 312.57 us = 0.07% latency, 439.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.92 us = 0.07% latency, 466.02 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS) ) ) (74): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.74 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 586.99 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.81 us = 0.01% latency, 941.36 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.6 us = 0.04% latency, 4.95 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.43 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.87 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 460.62 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.99 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 241.04 us = 0.06% latency, 285.09 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.68 us = 0.04% latency, 114.02 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.72 us = 0.03% latency, 118.71 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.08 us = 0.03% latency, 130.07 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 225.78 us = 0.05% latency, 304.36 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.53 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 330.21 us = 0.08% latency, 416.22 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.54 us = 0.07% latency, 469.81 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS) ) ) (75): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 129.83 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 583.65 us = 0.14% latency, 1.38 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 961.11 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 162.12 us = 0.04% latency, 4.97 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.9 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.84 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.8 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.09 us = 0.06% latency, 286.23 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 145.44 us = 0.03% latency, 118.13 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 129.46 us = 0.03% latency, 132.7 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.55 us = 0.05% latency, 298.07 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.09 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 347.36 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 329.49 us = 0.08% latency, 417.12 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 314.95 us = 0.07% latency, 436.38 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 292.06 us = 0.07% latency, 470.58 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.63 us = 0.02% latency, 421.37 GFLOPS) ) ) (76): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.05 ms = 1.18% latency, 129.42 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 589.37 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.81 us = 0.01% latency, 941.36 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.99 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.65 ms = 0.62% latency, 90.85 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 463.25 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.03 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.95 us = 0.06% latency, 282.86 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 150.92 us = 0.04% latency, 113.84 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 144.72 us = 0.03% latency, 118.71 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 136.14 us = 0.03% latency, 126.2 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.56 us = 0.03% latency, 129.6 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 229.36 us = 0.05% latency, 299.62 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.13 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 343.98 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 333.31 us = 0.08% latency, 412.35 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 295.88 us = 0.07% latency, 464.51 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.11 us = 0.02% latency, 418.86 GFLOPS) ) ) (77): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.03 ms = 1.18% latency, 130.02 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 586.99 us = 0.14% latency, 1.37 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 961.11 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.08 us = 0.04% latency, 4.94 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 243.66 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.63 ms = 0.62% latency, 91.36 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 461.82 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 233.41 us = 0.05% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 240.8 us = 0.06% latency, 285.38 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.16 us = 0.04% latency, 113.66 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 146.15 us = 0.03% latency, 117.55 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 133.75 us = 0.03% latency, 128.44 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.37 us = 0.03% latency, 130.78 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 230.07 us = 0.05% latency, 298.68 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.76 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.19 ms = 0.28% latency, 346.25 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 331.64 us = 0.08% latency, 414.42 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 313.28 us = 0.07% latency, 438.71 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 294.21 us = 0.07% latency, 467.15 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.39 us = 0.02% latency, 422.64 GFLOPS) ) ) (78): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.04 ms = 1.18% latency, 129.69 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 599.38 us = 0.14% latency, 1.34 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 43.63 us = 0.01% latency, 751.03 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 163.32 us = 0.04% latency, 4.93 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 244.38 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 90.95 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.29 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 236.51 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 242.23 us = 0.06% latency, 283.69 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.06 us = 0.03% latency, 116.03 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 148.3 us = 0.03% latency, 115.85 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 134.23 us = 0.03% latency, 127.99 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 132.8 us = 0.03% latency, 129.37 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 223.88 us = 0.05% latency, 306.95 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 240.09 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.18 ms = 0.28% latency, 349.54 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 325.44 us = 0.08% latency, 422.32 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 311.61 us = 0.07% latency, 441.06 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.02 us = 0.07% latency, 469.05 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 79.87 us = 0.02% latency, 420.11 GFLOPS) ) ) (79): DiTLayer( 100.68 M = 1.25% Params, 326.82 GMACs = 1.25% MACs, 5.06 ms = 1.18% latency, 129.24 TFLOPS (input_layernorm): AdaLayerNormZero( 25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 598.67 us = 0.14% latency, 1.35 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.15 us = 0.01% latency, 858.99 MFLOPS) (linear): Linear(25.18 M = 0.31% Params, 402.65 MMACs = 0% MACs, 165.46 us = 0.04% latency, 4.87 TFLOPS, in_features=2048, out_features=12288, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 245.09 us = 0.06% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 25.17 M = 0.31% Params, 120.26 GMACs = 0.46% MACs, 2.64 ms = 0.62% latency, 90.97 TFLOPS (q_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 462.77 us = 0.11% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(128 = 0% Params, 0 MACs = 0% MACs, 235.32 us = 0.06% latency, 0 FLOPS) (q_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 247.96 us = 0.06% latency, 277.14 TFLOPS, in_features=2048, out_features=4096, bias=False) (k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 151.87 us = 0.04% latency, 113.12 TFLOPS, in_features=2048, out_features=1024, bias=False) (v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 147.82 us = 0.03% latency, 116.22 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_k_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 135.9 us = 0.03% latency, 126.42 TFLOPS, in_features=2048, out_features=1024, bias=False) (text_v_proj): Linear(2.1 M = 0.03% Params, 8.59 GMACs = 0.03% MACs, 131.61 us = 0.03% latency, 130.54 TFLOPS, in_features=2048, out_features=1024, bias=False) (o_proj): Linear(8.39 M = 0.1% Params, 34.36 GMACs = 0.13% MACs, 223.4 us = 0.05% latency, 307.61 TFLOPS, in_features=4096, out_features=2048, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 239.85 us = 0.06% latency, 0 FLOPS) (mlp): GemmaMLP( 50.33 M = 0.62% Params, 206.16 GMACs = 0.79% MACs, 1.2 ms = 0.28% latency, 344.18 TFLOPS (gate_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 339.27 us = 0.08% latency, 405.1 TFLOPS, in_features=2048, out_features=8192, bias=False) (up_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 315.9 us = 0.07% latency, 435.06 TFLOPS, in_features=2048, out_features=8192, bias=False) (down_proj): Linear(16.78 M = 0.21% Params, 68.72 GMACs = 0.26% MACs, 293.02 us = 0.07% latency, 469.05 TFLOPS, in_features=8192, out_features=2048, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 80.59 us = 0.02% latency, 416.38 GFLOPS) ) ) ) (patch_embed): PatchEmbed( 133.12 K = 0% Params, 536.87 MMACs = 0% MACs, 528.57 us = 0.12% latency, 2.05 TFLOPS (proj): Conv2d(133.12 K = 0% Params, 536.87 MMACs = 0% MACs, 328.54 us = 0.08% latency, 3.29 TFLOPS, 16, 2048, kernel_size=(2, 2), stride=(2, 2)) ) (rotary_emb): GemmaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 s = 0% latency, 0 FLOPS) (time_proj): Timesteps(0 = 0% Params, 0 MACs = 0% MACs, 375.03 us = 0.09% latency, 0 FLOPS) (timestep_embedder): Sequential( 4.72 M = 0.06% Params, 75.5 MMACs = 0% MACs, 568.63 us = 0.13% latency, 265.6 GFLOPS (0): Linear(526.34 K = 0.01% Params, 8.39 MMACs = 0% MACs, 251.29 us = 0.06% latency, 66.76 GFLOPS, in_features=256, out_features=2048, bias=True) (1): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 51.5 us = 0.01% latency, 636.29 MFLOPS) (2): Linear(4.2 M = 0.05% Params, 67.11 MMACs = 0% MACs, 191.21 us = 0.04% latency, 701.93 GFLOPS, in_features=2048, out_features=2048, bias=True) ) (context_embedder): Sequential( 4.2 M = 0.05% Params, 17.18 GMACs = 0.07% MACs, 483.51 us = 0.11% latency, 71.06 TFLOPS (0): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 190.26 us = 0.04% latency, 0 FLOPS) (1): Linear(4.2 M = 0.05% Params, 17.18 GMACs = 0.07% MACs, 240.56 us = 0.06% latency, 142.83 TFLOPS, in_features=2048, out_features=2048, bias=True) ) (norm_out): AdaLayerNormOut( 8.39 M = 0.1% Params, 134.22 MMACs = 0% MACs, 700.47 us = 0.16% latency, 383.27 GFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.01% latency, 881.02 MFLOPS) (linear): Linear(8.39 M = 0.1% Params, 134.22 MMACs = 0% MACs, 268.7 us = 0.06% latency, 999.02 GFLOPS, in_features=2048, out_features=4096, bias=True) (norm): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 241.52 us = 0.06% latency, 0 FLOPS) ) (proj_out): Linear(131.14 K = 0% Params, 536.87 MMACs = 0% MACs, 185.01 us = 0.04% latency, 5.8 TFLOPS, in_features=2048, out_features=64, bias=True) (repa_projector): Sequential( 9.97 M = 0.12% Params, 40.8 GMACs = 0.16% MACs, 710.01 us = 0.17% latency, 114.96 TFLOPS (0): Linear(4.2 M = 0.05% Params, 17.18 GMACs = 0.07% MACs, 214.58 us = 0.05% latency, 160.13 TFLOPS, in_features=2048, out_features=2048, bias=True) (1): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 246.04 GFLOPS) (2): Linear(4.2 M = 0.05% Params, 17.18 GMACs = 0.07% MACs, 175.95 us = 0.04% latency, 195.28 TFLOPS, in_features=2048, out_features=2048, bias=True) (3): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 29.56 us = 0.01% latency, 283.74 GFLOPS) (4): Linear(1.57 M = 0.02% Params, 6.44 GMACs = 0.02% MACs, 158.55 us = 0.04% latency, 81.27 TFLOPS, in_features=2048, out_features=768, bias=True) ) ) ------------------------------------------------------------------------------