-------------------------- DeepSpeed Flops Profiler -------------------------- Profile Summary at step 2: Notations: data parallel size (dp_size), model parallel size(mp_size), number of parameters (params), number of multiply-accumulate operations(MACs), number of floating-point operations (flops), floating-point operations per second (FLOPS), fwd latency (forward propagation latency), bwd latency (backward propagation latency), step (weights update latency), iter latency (sum of fwd, bwd and step latency) world size: 32 data parallel size: 32 model parallel size: 1 batch size per GPU: 16 params per GPU: 8.09 B params of model = params per GPU * mp_size: 8.09 B fwd MACs per GPU: 21.91 TMACs fwd flops per GPU: 43.82 T fwd flops of model = fwd flops per GPU * mp_size: 43.82 T fwd latency: 232.43 ms fwd FLOPS per GPU = fwd flops per GPU / fwd latency: 188.55 TFLOPS bwd latency: 857.36 ms bwd FLOPS per GPU = 2 * fwd flops per GPU / bwd latency: 102.23 TFLOPS fwd+bwd FLOPS per GPU = 3 * fwd flops per GPU / (fwd+bwd latency): 120.64 TFLOPS step latency: 387.81 ms iter latency: 1.48 s FLOPS per GPU = 3 * fwd flops per GPU / iter latency: 88.98 TFLOPS samples/second: 346.51 ----------------------------- Aggregated Profile per GPU ----------------------------- Top 1 modules in terms of params, MACs or fwd latency at different model depths: depth 0: params - {'DiT': '8.09 B'} MACs - {'DiT': '21.91 TMACs'} fwd latency - {'DiT': '232.26 ms'} depth 1: params - {'ModuleList': '8.02 B'} MACs - {'ModuleList': '21.82 TMACs'} fwd latency - {'ModuleList': '221.31 ms'} depth 2: params - {'DiTLayer': '8.02 B'} MACs - {'DiTLayer': '21.82 TMACs'} fwd latency - {'DiTLayer': '221.31 ms'} depth 3: params - {'GemmaMLP': '3.77 B'} MACs - {'GemmaMLP': '15.46 TMACs'} fwd latency - {'DiTSelfAttention': '94.33 ms'} ------------------------------ Detailed Profile per GPU ------------------------------ Each module profile is listed after its name in the following order: params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs (or latency) and the sum of its submodules'. 2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput. 3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch, thus it's less than the fwd latency shown above which is captured in DeepSpeed. DiT( 8.09 B = 100% Params, 21.91 TMACs = 100% MACs, 232.26 ms = 100% latency, 188.69 TFLOPS (layers): ModuleList( (0): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 7.05 ms = 3.03% latency, 193.51 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 1.01 ms = 0.43% latency, 2.81 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.02% latency, 1.71 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 331.4 us = 0.14% latency, 8.54 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.4 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.95 ms = 1.27% latency, 133.91 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.36 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 336.89 us = 0.15% latency, 358.57 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.69 us = 0.08% latency, 157.54 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.83 us = 0.08% latency, 159.93 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.06 us = 0.07% latency, 197.3 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.2 us = 0.06% latency, 201.05 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 316.62 us = 0.14% latency, 381.52 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.78 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.03 ms = 0.87% latency, 476.37 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.37 us = 0.25% latency, 546.55 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.51 us = 0.25% latency, 549.22 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 543.36 us = 0.23% latency, 592.84 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 111.1 us = 0.05% latency, 377.51 GFLOPS) ) ) (1): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.88 ms = 2.96% latency, 198.14 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 904.56 us = 0.39% latency, 3.13 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.48 us = 0.02% latency, 1.68 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 235.8 us = 0.1% latency, 12.01 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.91 ms = 1.25% latency, 135.41 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.41 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.36 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.48 us = 0.14% latency, 378.1 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.97 us = 0.07% latency, 194.87 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.92 us = 0.06% latency, 200.1 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 311.37 us = 0.13% latency, 387.94 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.54 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.87% latency, 479.69 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.85 us = 0.25% latency, 546.11 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.36 us = 0.25% latency, 551.24 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 543.36 us = 0.23% latency, 592.84 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 99.18 us = 0.04% latency, 422.89 GFLOPS) ) ) (2): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.96 ms = 2.99% latency, 196.08 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 926.02 us = 0.4% latency, 3.06 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 47.92 us = 0.02% latency, 1.28 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 237.46 us = 0.1% latency, 11.92 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 438.21 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.98 ms = 1.28% latency, 132.25 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 447.27 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.98 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 327.83 us = 0.14% latency, 368.48 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 194.55 us = 0.08% latency, 155.23 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 160.22 us = 0.07% latency, 188.49 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 159.98 us = 0.07% latency, 188.77 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 315.19 us = 0.14% latency, 383.25 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 429.63 us = 0.18% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 483.41 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.79 us = 0.25% latency, 549.89 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.65 us = 0.25% latency, 551.91 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 541.21 us = 0.23% latency, 595.19 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS) ) ) (3): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.61 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 895.5 us = 0.39% latency, 3.16 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36 us = 0.02% latency, 1.71 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 225.07 us = 0.1% latency, 12.58 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 436.54 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.78 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.17 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 233.17 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 318.53 us = 0.14% latency, 379.23 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.69 us = 0.08% latency, 157.54 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 160.22 us = 0.07% latency, 188.49 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.4 us = 0.07% latency, 199.47 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.04 us = 0.13% latency, 392.15 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 484.16 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.93 us = 0.25% latency, 552.59 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.6 us = 0.25% latency, 555.77 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 541.21 us = 0.23% latency, 595.19 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.13 us = 0.04% latency, 440.91 GFLOPS) ) ) (4): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.9 ms = 2.97% latency, 197.66 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 901.7 us = 0.39% latency, 3.14 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 39.34 us = 0.02% latency, 1.56 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 228.17 us = 0.1% latency, 12.41 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.97 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.65 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.89 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 232.7 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.2 us = 0.14% latency, 377.26 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.35 us = 0.08% latency, 160.33 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.83 us = 0.07% latency, 197.6 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.73 us = 0.06% latency, 201.69 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 322.34 us = 0.14% latency, 374.74 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.73 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.86% latency, 481.86 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.75 us = 0.25% latency, 549 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.56 us = 0.25% latency, 550.11 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.3 us = 0.23% latency, 597.29 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS) ) ) (5): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.9 ms = 2.97% latency, 197.64 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 912.9 us = 0.39% latency, 3.1 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.68 us = 0.02% latency, 1.44 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 230.55 us = 0.1% latency, 12.28 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.64 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.94 ms = 1.26% latency, 134.42 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.41 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.74 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 325.2 us = 0.14% latency, 371.45 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 193.83 us = 0.08% latency, 155.8 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 189.07 us = 0.08% latency, 159.73 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.54 us = 0.07% latency, 196.68 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.97 us = 0.06% latency, 201.37 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 321.87 us = 0.14% latency, 375.3 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.3 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.26 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.13 us = 0.25% latency, 551.46 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.46 us = 0.25% latency, 553.04 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.4 us = 0.23% latency, 593.88 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.65 us = 0.04% latency, 443.13 GFLOPS) ) ) (6): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.42 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 910.76 us = 0.39% latency, 3.11 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 43.39 us = 0.02% latency, 1.42 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 229.12 us = 0.1% latency, 12.36 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.83 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.92 ms = 1.26% latency, 135.15 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.07 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 324.49 us = 0.14% latency, 372.27 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.35 us = 0.08% latency, 160.33 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.73 us = 0.07% latency, 195.17 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.92 us = 0.06% latency, 200.1 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.75 us = 0.13% latency, 391.24 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 429.39 us = 0.18% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.84 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.99 us = 0.25% latency, 548.77 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.89 us = 0.25% latency, 551.69 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 541.21 us = 0.23% latency, 595.19 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.41 us = 0.04% latency, 444.25 GFLOPS) ) ) (7): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.85 ms = 2.95% latency, 198.98 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 904.08 us = 0.39% latency, 3.13 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.67 us = 0.02% latency, 1.63 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 231.27 us = 0.1% latency, 12.24 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.64 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.91 ms = 1.25% latency, 135.67 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 442.03 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.27 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.43 us = 0.14% latency, 376.98 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.59 us = 0.08% latency, 160.13 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.4 us = 0.07% latency, 199.47 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 148.77 us = 0.06% latency, 202.99 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 307.08 us = 0.13% latency, 393.37 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 484.45 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.7 us = 0.25% latency, 552.82 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.83 us = 0.25% latency, 555.54 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 540.02 us = 0.23% latency, 596.5 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.84 us = 0.04% latency, 437.62 GFLOPS) ) ) (8): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.89 ms = 2.97% latency, 197.85 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 896.22 us = 0.39% latency, 3.16 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.91 us = 0.02% latency, 1.62 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 218.63 us = 0.09% latency, 12.95 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 437.02 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.94 ms = 1.26% latency, 134.38 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.17 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.84 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 322.1 us = 0.14% latency, 375.02 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.17 us = 0.08% latency, 157.15 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.26 us = 0.07% latency, 195.77 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.59 us = 0.07% latency, 197.91 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 314.71 us = 0.14% latency, 383.83 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.54 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.86% latency, 481.23 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 588.18 us = 0.25% latency, 547.66 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.79 us = 0.25% latency, 549.89 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.88 us = 0.23% latency, 593.36 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.13 us = 0.04% latency, 440.91 GFLOPS) ) ) (9): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.9 ms = 2.97% latency, 197.73 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 900.27 us = 0.39% latency, 3.14 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.02% latency, 1.65 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 227.45 us = 0.1% latency, 12.45 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.88 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 442.98 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.74 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 323.3 us = 0.14% latency, 373.64 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.88 us = 0.08% latency, 156.57 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.35 us = 0.08% latency, 160.33 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.5 us = 0.07% latency, 195.47 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.06 us = 0.07% latency, 197.3 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 311.14 us = 0.13% latency, 388.24 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.02 ms = 0.87% latency, 479.3 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.51 us = 0.25% latency, 549.22 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.65 us = 0.25% latency, 551.91 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 548.84 us = 0.24% latency, 586.92 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.61 us = 0.04% latency, 438.71 GFLOPS) ) ) (10): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.91 ms = 2.97% latency, 197.5 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 921.25 us = 0.4% latency, 3.07 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 38.62 us = 0.02% latency, 1.59 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 241.28 us = 0.1% latency, 11.73 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 437.5 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.72 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.17 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.27 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 322.58 us = 0.14% latency, 374.47 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.17 us = 0.08% latency, 157.15 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.02 us = 0.08% latency, 158.93 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.02 us = 0.07% latency, 196.07 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.87 us = 0.07% latency, 198.84 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 310.18 us = 0.13% latency, 389.44 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.82 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.09 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.7 us = 0.25% latency, 548.11 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.32 us = 0.25% latency, 550.34 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 538.83 us = 0.23% latency, 597.82 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.37 us = 0.04% latency, 439.8 GFLOPS) ) ) (11): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.89 ms = 2.97% latency, 197.83 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 917.67 us = 0.4% latency, 3.09 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 39.1 us = 0.02% latency, 1.57 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 235.08 us = 0.1% latency, 12.04 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 437.74 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.53 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.17 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.36 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 326.4 us = 0.14% latency, 370.09 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 199.32 us = 0.09% latency, 151.51 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.83 us = 0.07% latency, 197.6 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.97 us = 0.06% latency, 201.37 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 310.9 us = 0.13% latency, 388.54 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.82 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 484.86 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.36 us = 0.25% latency, 551.24 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.83 us = 0.25% latency, 555.54 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 537.63 us = 0.23% latency, 599.15 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.94 us = 0.04% latency, 446.5 GFLOPS) ) ) (12): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.93 ms = 2.98% latency, 196.9 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 897.88 us = 0.39% latency, 3.15 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.95 us = 0.02% latency, 1.66 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 226.02 us = 0.1% latency, 12.53 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.68 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.96 ms = 1.27% latency, 133.48 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 446.8 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 232.46 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 321.39 us = 0.14% latency, 375.86 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 194.07 us = 0.08% latency, 155.61 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 156.16 us = 0.07% latency, 193.38 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.5 us = 0.07% latency, 195.47 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 324.25 us = 0.14% latency, 372.54 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.73 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.02 ms = 0.87% latency, 478.73 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 588.66 us = 0.25% latency, 547.22 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.99 us = 0.25% latency, 548.77 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 550.51 us = 0.24% latency, 585.14 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS) ) ) (13): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.93 ms = 2.98% latency, 196.79 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 926.97 us = 0.4% latency, 3.05 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 50.78 us = 0.02% latency, 1.21 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 236.99 us = 0.1% latency, 11.95 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.83 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.95 ms = 1.27% latency, 133.87 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.41 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.55 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 326.63 us = 0.14% latency, 369.82 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 195.03 us = 0.08% latency, 154.85 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.73 us = 0.07% latency, 195.17 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.87 us = 0.07% latency, 198.84 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 310.9 us = 0.13% latency, 388.54 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.73 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.61 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.22 us = 0.25% latency, 548.55 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.36 us = 0.25% latency, 551.24 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.3 us = 0.23% latency, 597.29 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS) ) ) (14): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.48 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 894.07 us = 0.38% latency, 3.17 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.91 us = 0.02% latency, 1.62 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 224.83 us = 0.1% latency, 12.59 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.66 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.7 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.55 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.67 us = 0.14% latency, 376.7 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.35 us = 0.07% latency, 198.22 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.44 us = 0.06% latency, 200.73 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 324.49 us = 0.14% latency, 372.27 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.38 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.7 us = 0.25% latency, 548.11 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.08 us = 0.25% latency, 550.56 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.54 us = 0.23% latency, 597.03 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.89 us = 0.04% latency, 442.01 GFLOPS) ) ) (15): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.84 ms = 2.94% latency, 199.42 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 895.98 us = 0.39% latency, 3.16 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.43 us = 0.02% latency, 1.64 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 228.4 us = 0.1% latency, 12.4 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.16 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 136.02 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 442.27 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 229.12 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.02 us = 0.08% latency, 158.93 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.59 us = 0.08% latency, 160.13 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 152.11 us = 0.07% latency, 198.53 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.01 us = 0.06% latency, 202.66 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 307.56 us = 0.13% latency, 392.76 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.54 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 486.08 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.41 us = 0.25% latency, 552.14 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.12 us = 0.25% latency, 556.23 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 536.68 us = 0.23% latency, 600.21 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS) ) ) (16): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.44 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 908.85 us = 0.39% latency, 3.12 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.02% latency, 1.67 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 230.07 us = 0.1% latency, 12.31 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.25 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.92 ms = 1.26% latency, 135.36 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.65 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.07 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.96 us = 0.14% latency, 377.54 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.45 us = 0.08% latency, 157.74 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.83 us = 0.08% latency, 159.93 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.68 us = 0.06% latency, 200.42 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 148.77 us = 0.06% latency, 202.99 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 315.9 us = 0.14% latency, 382.38 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.86% latency, 481.98 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.22 us = 0.25% latency, 548.55 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.03 us = 0.25% latency, 549.67 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.16 us = 0.23% latency, 594.14 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.41 us = 0.04% latency, 444.25 GFLOPS) ) ) (17): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.86 ms = 2.95% latency, 198.82 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 901.7 us = 0.39% latency, 3.14 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.91 us = 0.02% latency, 1.62 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 227.69 us = 0.1% latency, 12.43 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.59 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.92 ms = 1.26% latency, 135.35 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.13 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 228.17 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.96 us = 0.14% latency, 377.54 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.88 us = 0.08% latency, 156.57 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 160.69 us = 0.07% latency, 187.93 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.63 us = 0.07% latency, 199.16 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 303.98 us = 0.13% latency, 397.38 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.3 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 483.93 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.93 us = 0.25% latency, 552.59 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.46 us = 0.25% latency, 553.04 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.16 us = 0.23% latency, 594.14 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.7 us = 0.04% latency, 447.64 GFLOPS) ) ) (18): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.91 ms = 2.98% latency, 197.24 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 904.56 us = 0.39% latency, 3.13 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.02% latency, 1.65 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 227.45 us = 0.1% latency, 12.45 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.35 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.94 ms = 1.27% latency, 134.1 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.13 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 233.17 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.72 us = 0.14% latency, 377.82 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.93 us = 0.08% latency, 157.35 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 189.3 us = 0.08% latency, 159.53 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.26 us = 0.07% latency, 195.77 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.06 us = 0.07% latency, 197.3 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 313.76 us = 0.14% latency, 385 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.25 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.87% latency, 479.86 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 588.66 us = 0.25% latency, 547.22 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.84 us = 0.25% latency, 550.79 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 547.17 us = 0.24% latency, 588.71 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.13 us = 0.04% latency, 440.91 GFLOPS) ) ) (19): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.9 ms = 2.97% latency, 197.68 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 903.37 us = 0.39% latency, 3.13 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.19 us = 0.02% latency, 1.65 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 230.07 us = 0.1% latency, 12.31 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.11 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.96 ms = 1.27% latency, 133.38 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.89 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 233.65 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.91 us = 0.14% latency, 376.42 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 193.6 us = 0.08% latency, 155.99 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 189.78 us = 0.08% latency, 159.13 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 156.88 us = 0.07% latency, 192.5 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 155.45 us = 0.07% latency, 194.27 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 316.14 us = 0.14% latency, 382.09 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.01 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 485.96 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 584.36 us = 0.25% latency, 551.24 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 577.69 us = 0.25% latency, 557.61 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 536.68 us = 0.23% latency, 600.21 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS) ) ) (20): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.87 ms = 2.96% latency, 198.49 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 891.45 us = 0.38% latency, 3.18 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.02% latency, 1.7 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 223.64 us = 0.1% latency, 12.66 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.88 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.93 ms = 1.26% latency, 134.84 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.6 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 234.6 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 320.91 us = 0.14% latency, 376.42 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.4 us = 0.08% latency, 156.96 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.06 us = 0.07% latency, 197.3 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.01 us = 0.06% latency, 202.66 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 313.76 us = 0.14% latency, 385 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.01 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.32 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.37 us = 0.25% latency, 546.55 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.65 us = 0.25% latency, 551.91 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.3 us = 0.23% latency, 597.29 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 95.84 us = 0.04% latency, 437.62 GFLOPS) ) ) (21): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.98 ms = 3.01% latency, 195.29 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 889.3 us = 0.38% latency, 3.18 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.02% latency, 1.67 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 223.64 us = 0.1% latency, 12.66 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.88 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 3 ms = 1.29% latency, 131.62 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.6 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 234.37 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 318.53 us = 0.14% latency, 379.23 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 200.51 us = 0.09% latency, 150.61 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.4 us = 0.08% latency, 156.96 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 168.32 us = 0.07% latency, 179.41 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 159.26 us = 0.07% latency, 189.62 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 324.73 us = 0.14% latency, 371.99 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.04 ms = 0.88% latency, 474.47 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 591.04 us = 0.25% latency, 545.01 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.89 us = 0.25% latency, 551.69 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 555.75 us = 0.24% latency, 579.61 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 97.27 us = 0.04% latency, 431.18 GFLOPS) ) ) (22): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 7.74 ms = 3.33% latency, 176.26 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 928.4 us = 0.4% latency, 3.05 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.2 us = 0.02% latency, 1.46 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 238.42 us = 0.1% latency, 11.87 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.83 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 3.5 ms = 1.51% latency, 112.84 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 236.27 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 326.87 us = 0.14% latency, 369.55 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 194.31 us = 0.08% latency, 155.42 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 168.09 us = 0.07% latency, 179.66 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 164.75 us = 0.07% latency, 183.3 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 412.23 us = 0.18% latency, 293.03 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 446.8 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.15 ms = 0.93% latency, 449.63 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 667.57 us = 0.29% latency, 482.53 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 606.78 us = 0.26% latency, 530.88 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 557.66 us = 0.24% latency, 577.63 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 98.47 us = 0.04% latency, 425.96 GFLOPS) ) ) (23): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.94 ms = 2.99% latency, 196.63 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 939.85 us = 0.4% latency, 3.01 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 42.44 us = 0.02% latency, 1.45 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 245.81 us = 0.11% latency, 11.52 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 439.41 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.95 ms = 1.27% latency, 133.87 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 458.24 us = 0.2% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.79 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 328.78 us = 0.14% latency, 367.41 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.92 us = 0.06% latency, 200.1 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.73 us = 0.06% latency, 201.69 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 311.85 us = 0.13% latency, 387.35 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.73 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 484.63 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 581.26 us = 0.25% latency, 554.18 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 578.17 us = 0.25% latency, 557.15 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 546.46 us = 0.24% latency, 589.48 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.7 us = 0.04% latency, 447.64 GFLOPS) ) ) (24): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.85 ms = 2.95% latency, 199.11 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 890.02 us = 0.38% latency, 3.18 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.24 us = 0.02% latency, 1.7 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 224.11 us = 0.1% latency, 12.63 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.16 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.91 ms = 1.25% latency, 135.63 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 445.37 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.55 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.48 us = 0.14% latency, 378.1 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.17 us = 0.08% latency, 157.15 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.83 us = 0.08% latency, 159.93 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.44 us = 0.06% latency, 200.73 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 148.53 us = 0.06% latency, 203.31 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.04 us = 0.13% latency, 392.15 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.44 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.09 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.99 us = 0.25% latency, 548.77 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 587.46 us = 0.25% latency, 548.33 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 542.4 us = 0.23% latency, 593.88 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.22 us = 0.04% latency, 449.93 GFLOPS) ) ) (25): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.85 ms = 2.95% latency, 199.09 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 908.37 us = 0.39% latency, 3.12 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 46.49 us = 0.02% latency, 1.32 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 227.69 us = 0.1% latency, 12.43 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 435.59 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 136 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 442.98 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.79 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319.24 us = 0.14% latency, 378.38 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.45 us = 0.08% latency, 157.74 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 151.16 us = 0.07% latency, 199.79 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.73 us = 0.06% latency, 201.69 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 305.65 us = 0.13% latency, 395.21 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.78 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 484.28 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.99 us = 0.25% latency, 548.77 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.7 us = 0.25% latency, 552.82 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.54 us = 0.23% latency, 597.03 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.18 us = 0.04% latency, 445.37 GFLOPS) ) ) (26): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.92 ms = 2.98% latency, 196.97 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 906.23 us = 0.39% latency, 3.12 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 39.1 us = 0.02% latency, 1.57 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 230.55 us = 0.1% latency, 12.28 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 436.54 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.96 ms = 1.27% latency, 133.41 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.89 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.31 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 322.1 us = 0.14% latency, 375.02 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 231.27 us = 0.1% latency, 130.58 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 192.64 us = 0.08% latency, 156.76 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 153.3 us = 0.07% latency, 196.99 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.49 us = 0.06% latency, 202.02 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 303.98 us = 0.13% latency, 397.38 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.54 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2.01 ms = 0.86% latency, 481.17 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.85 us = 0.25% latency, 546.11 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.75 us = 0.25% latency, 549 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 543.83 us = 0.23% latency, 592.32 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.41 us = 0.04% latency, 444.25 GFLOPS) ) ) (27): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.85 ms = 2.95% latency, 199.21 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 893.35 us = 0.38% latency, 3.17 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.02% latency, 1.67 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 226.26 us = 0.1% latency, 12.51 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.64 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.92 ms = 1.26% latency, 135.2 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.65 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.31 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.5 us = 0.08% latency, 158.53 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.59 us = 0.08% latency, 160.13 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 162.84 us = 0.07% latency, 185.45 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.44 us = 0.06% latency, 200.73 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 306.61 us = 0.13% latency, 393.98 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.06 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 485.73 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 583.41 us = 0.25% latency, 552.14 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 579.83 us = 0.25% latency, 555.54 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 539.06 us = 0.23% latency, 597.56 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.94 us = 0.04% latency, 446.5 GFLOPS) ) ) (28): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.84 ms = 2.95% latency, 199.39 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 887.39 us = 0.38% latency, 3.19 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.57 us = 0.01% latency, 1.78 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 222.21 us = 0.1% latency, 12.74 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.92 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 135.85 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 444.41 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.98 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 321.39 us = 0.14% latency, 375.86 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.59 us = 0.08% latency, 160.13 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.44 us = 0.06% latency, 200.73 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.68 us = 0.06% latency, 200.42 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 307.08 us = 0.13% latency, 393.37 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 482.26 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 589.61 us = 0.25% latency, 546.33 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.03 us = 0.25% latency, 549.67 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 540.02 us = 0.23% latency, 596.5 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.94 us = 0.04% latency, 446.5 GFLOPS) ) ) (29): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.83 ms = 2.94% latency, 199.75 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 891.69 us = 0.38% latency, 3.18 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 35.29 us = 0.02% latency, 1.74 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 224.83 us = 0.1% latency, 12.59 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.4 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 135.92 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.31 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.73 us = 0.08% latency, 158.33 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 189.54 us = 0.08% latency, 159.33 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.68 us = 0.06% latency, 200.42 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.01 us = 0.06% latency, 202.66 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.04 us = 0.13% latency, 392.15 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 431.78 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 485.44 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 586.75 us = 0.25% latency, 549 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.46 us = 0.25% latency, 553.04 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 536.68 us = 0.23% latency, 600.21 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.94 us = 0.04% latency, 446.5 GFLOPS) ) ) (30): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.82 ms = 2.94% latency, 199.89 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 884.29 us = 0.38% latency, 3.2 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.57 us = 0.01% latency, 1.78 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 221.25 us = 0.1% latency, 12.8 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 433.44 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.9 ms = 1.25% latency, 136.03 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.7 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 231.03 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 190.26 us = 0.08% latency, 158.73 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.97 us = 0.06% latency, 201.37 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 149.73 us = 0.06% latency, 201.69 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 305.18 us = 0.13% latency, 395.82 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 2 ms = 0.86% latency, 483.82 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 588.42 us = 0.25% latency, 547.44 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 585.32 us = 0.25% latency, 550.34 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 538.83 us = 0.23% latency, 597.82 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 93.7 us = 0.04% latency, 447.64 GFLOPS) ) ) (31): DiTLayer( 250.71 M = 3.1% Params, 681.9 GMACs = 3.11% MACs, 6.83 ms = 2.94% latency, 199.61 TFLOPS (input_layernorm): AdaLayerNormZero( 88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 894.07 us = 0.38% latency, 3.17 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 36.72 us = 0.02% latency, 1.67 GFLOPS) (linear): Linear(88.5 M = 1.09% Params, 1.42 GMACs = 0.01% MACs, 225.31 us = 0.1% latency, 12.57 TFLOPS, in_features=3840, out_features=23040, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 434.4 us = 0.19% latency, 0 FLOPS) ) (self_attn): DiTSelfAttention( 44.24 M = 0.55% Params, 197.3 GMACs = 0.9% MACs, 2.91 ms = 1.25% latency, 135.73 TFLOPS (q_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 443.94 us = 0.19% latency, 0 FLOPS) (k_norm): GemmaRMSNorm(120 = 0% Params, 0 MACs = 0% MACs, 230.55 us = 0.1% latency, 0 FLOPS) (q_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 319 us = 0.14% latency, 378.67 TFLOPS, in_features=3840, out_features=3840, bias=False) (k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 191.21 us = 0.08% latency, 157.93 TFLOPS, in_features=3840, out_features=960, bias=False) (v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 188.11 us = 0.08% latency, 160.54 TFLOPS, in_features=3840, out_features=960, bias=False) (text_k_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 154.5 us = 0.07% latency, 195.47 TFLOPS, in_features=3840, out_features=960, bias=False) (text_v_proj): Linear(3.69 M = 0.05% Params, 15.1 GMACs = 0.07% MACs, 150.2 us = 0.06% latency, 201.05 TFLOPS, in_features=3840, out_features=960, bias=False) (o_proj): Linear(14.75 M = 0.18% Params, 60.4 GMACs = 0.28% MACs, 308.99 us = 0.13% latency, 390.94 TFLOPS, in_features=3840, out_features=3840, bias=False) ) (post_attention_layernorm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 430.11 us = 0.19% latency, 0 FLOPS) (mlp): GemmaMLP( 117.96 M = 1.46% Params, 483.18 GMACs = 2.21% MACs, 1.99 ms = 0.86% latency, 485.21 TFLOPS (gate_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 582.7 us = 0.25% latency, 552.82 TFLOPS, in_features=3840, out_features=10240, bias=False) (up_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 580.55 us = 0.25% latency, 554.86 TFLOPS, in_features=3840, out_features=10240, bias=False) (down_proj): Linear(39.32 M = 0.49% Params, 161.06 GMACs = 0.74% MACs, 540.26 us = 0.23% latency, 596.24 TFLOPS, in_features=10240, out_features=3840, bias=False) (act_fn): PytorchGELUTanh(0 = 0% Params, 0 MACs = 0% MACs, 94.41 us = 0.04% latency, 444.25 GFLOPS) ) ) ) (patch_embed): PatchEmbed( 249.6 K = 0% Params, 1.01 GMACs = 0% MACs, 603.2 us = 0.26% latency, 3.36 TFLOPS (proj): Conv2d(249.6 K = 0% Params, 1.01 GMACs = 0% MACs, 370.74 us = 0.16% latency, 5.47 TFLOPS, 16, 3840, kernel_size=(2, 2), stride=(2, 2)) ) (rotary_emb): GemmaRotaryEmbedding(0 = 0% Params, 0 MACs = 0% MACs, 0 s = 0% latency, 0 FLOPS) (time_proj): Timesteps(0 = 0% Params, 0 MACs = 0% MACs, 242.95 us = 0.1% latency, 0 FLOPS) (timestep_embedder): Sequential( 15.74 M = 0.19% Params, 251.66 MMACs = 0% MACs, 539.06 us = 0.23% latency, 933.8 GFLOPS (0): Linear(986.88 K = 0.01% Params, 15.73 MMACs = 0% MACs, 231.27 us = 0.1% latency, 136.02 GFLOPS, in_features=256, out_features=3840, bias=True) (1): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 47.68 us = 0.02% latency, 1.29 GFLOPS) (2): Linear(14.75 M = 0.18% Params, 235.93 MMACs = 0% MACs, 186.44 us = 0.08% latency, 2.53 TFLOPS, in_features=3840, out_features=3840, bias=True) ) (context_embedder): Sequential( 7.87 M = 0.1% Params, 32.21 GMACs = 0.15% MACs, 479.94 us = 0.21% latency, 134.24 TFLOPS (0): GemmaRMSNorm(2.05 K = 0% Params, 0 MACs = 0% MACs, 177.62 us = 0.08% latency, 0 FLOPS) (1): Linear(7.87 M = 0.1% Params, 32.21 GMACs = 0.15% MACs, 252.25 us = 0.11% latency, 255.4 TFLOPS, in_features=2048, out_features=3840, bias=True) ) (norm_out): AdaLayerNormOut( 29.5 M = 0.36% Params, 471.86 MMACs = 0% MACs, 845.19 us = 0.36% latency, 1.12 TFLOPS (silu): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 37.43 us = 0.02% latency, 1.64 GFLOPS) (linear): Linear(29.5 M = 0.36% Params, 471.86 MMACs = 0% MACs, 172.62 us = 0.07% latency, 5.47 TFLOPS, in_features=3840, out_features=7680, bias=True) (norm): GemmaRMSNorm(3.84 K = 0% Params, 0 MACs = 0% MACs, 432.97 us = 0.19% latency, 0 FLOPS) ) (proj_out): Linear(245.82 K = 0% Params, 1.01 GMACs = 0% MACs, 176.43 us = 0.08% latency, 11.41 TFLOPS, in_features=3840, out_features=64, bias=True) (repa_projector): Sequential( 13.64 M = 0.17% Params, 55.83 GMACs = 0.25% MACs, 751.73 us = 0.32% latency, 148.57 TFLOPS (0): Linear(7.87 M = 0.1% Params, 32.21 GMACs = 0.15% MACs, 267.27 us = 0.12% latency, 241.05 TFLOPS, in_features=3840, out_features=2048, bias=True) (1): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 34.09 us = 0.01% latency, 246.04 GFLOPS) (2): Linear(4.2 M = 0.05% Params, 17.18 GMACs = 0.08% MACs, 173.81 us = 0.07% latency, 197.69 TFLOPS, in_features=2048, out_features=2048, bias=True) (3): SiLU(0 = 0% Params, 0 MACs = 0% MACs, 28.85 us = 0.01% latency, 290.78 GFLOPS) (4): Linear(1.57 M = 0.02% Params, 6.44 GMACs = 0.03% MACs, 147.82 us = 0.06% latency, 87.17 TFLOPS, in_features=2048, out_features=768, bias=True) ) ) ------------------------------------------------------------------------------