detrex.modeling

backbone

class detrex.modeling.backbone.TimmBackbone(model_name: str, features_only: bool = True, pretrained: bool = False, checkpoint_path: str = '', in_channels: int = 3, out_indices: Tuple[int] = (0, 1, 2, 3), norm_layer: Optional[Module] = None)[source]

A wrapper for using backbone from timm library. Please see the document for feature extraction with timm for more details. :param model_name: Name of timm model to instantiate. :type model_name: str :param features_only: Whether to extract feature pyramid (multi-scale

feature maps from the deepest layer of each stage).

Parameters

pretrained (bool) – Whether to load pretrained weights. Default: False.
checkpoint_path (str) – Whether to load pretrained weights. Default: False.
in_channels (int) – The number of input channels. Default: 3.
out_indices (tuple[str]) – The extracted feature indices which select specific feature levels or limit the stride of the feature extractor.
out_features (tuple[str]) – A map for the output feature dict, e.g., set (“p0”, “p1”) to return only the feature from indices (0, 1) as {"p0": feature from indice 0, "p1": feature from indice 1}.
norm_layer (nn.Module) – Set the specified norm layer for feature extractor, e.g., set norm_layer=FrozenBatchNorm2d to freeze the norm layer in feature extractor.

forward(x)[source]

Forward function of TimmBackbone. :param x: the input tensor for feature extraction. :type x: torch.Tensor

Returns: mapping from feature name (e.g., “p1”) to tensor
Return type: dict[str->Tensor]

class detrex.modeling.backbone.TorchvisionBackbone(model_name: str = 'resnet50', pretrained: bool = False, return_nodes: Dict[str, str] = {'layer1': 'res2', 'layer2': 'res3', 'layer3': 'res4', 'layer4': 'res5'}, train_return_nodes: Optional[Dict[str, str]] = None, eval_return_nodes: Optional[Dict[str, str]] = None, tracer_kwargs: Optional[Dict[str, Any]] = None, suppress_diff_warnings: bool = False, **kwargs)[source]

A wrapper for torchvision pretrained backbones

Please check Feature extraction for model inspection for more details.

Parameters

model_name (str) – Name of torchvision models. Default: resnet50.
pretrained (bool) – Whether to load pretrained weights. Default: False.
weights (Optional[ResNet50_Weights]) – The pretrained weights to use. Default: None.
return_nodes (Dict[str, str]) – The keys are the node names and the values are the user-specified keys for the graph module’s returned dictionary.

forward(x)[source]

Forward function of TorchvisionBackbone

Parameters: x (torch.Tensor) – the input tensor for feature extraction.
Returns: mapping from feature name (e.g., “res2”) to tensor
Return type: dict[str->Tensor]

class detrex.modeling.backbone.ResNet(stem, stages, num_classes=None, out_features=None, freeze_at=0)[source]

Implement paper Deep Residual Learning for Image Recognition.

Parameters

stem (nn.Module) – a stem module.
stages (list[list[detectron2.layers.CNNBlockBase]]) – several (typically 4) stages, each contains multiple detectron2.layers.CNNBlockBase.
num_classes (None or int) – if None, will not perform classification. Otherwise, will create a linear layer.
out_features (list[str]) – name of the layers whose outputs should be returned in forward. Can be anything in “stem”, “linear”, or “res2” … If None, will return the output of the last layer.
freeze_at (int) – The number of stages at the beginning to freeze. see freeze() for detailed explanation.

forward(x)[source]

Parameters: x – Tensor of shape (N,C,H,W). H, W must be a multiple of self.size_divisibility.
Returns: names and the corresponding features
Return type: dict[str->Tensor]

output_shape()[source]

Returns: dict[str->ShapeSpec]

freeze(freeze_at=0)[source]

Freeze the first several stages of the ResNet. Commonly used in fine-tuning. Layers that produce the same feature map spatial size are defined as one “stage” by paper Feature Pyramid Networks for Object Detection.

Parameters: freeze_at (int) – number of stages to freeze. 1 means freezing the stem. 2 means freezing the stem and one residual stage, etc.
Returns: this ResNet itself
Return type: nn.Module

static make_stage(block_class, num_blocks, *, in_channels, out_channels, **kwargs)[source]

Create a list of blocks of the same type that forms one ResNet stage.

Parameters

block_class (type) – a subclass of detectron2.layers.CNNBlockBase that’s used to create all blocks in this stage. A module of this type must not change spatial resolution of inputs unless its stride != 1.
num_blocks (int) – number of blocks in this stage
in_channels (int) – input channels of the entire stage.
out_channels (int) – output channels of every block in the stage.
kwargs – other arguments passed to the constructor of block_class. If the argument name is “xx_per_block”, the argument is a list of values to be passed to each block in the stage. Otherwise, the same argument is passed to every block in the stage.

Returns

a list of block module.

Return type

list[detectron2.layers.CNNBlockBase]

Examples:

stage = ResNet.make_stage(
    BottleneckBlock, 3, in_channels=16, out_channels=64,
    bottleneck_channels=16, num_groups=1,
    stride_per_block=[2, 1, 1],
    dilations_per_block=[1, 1, 2]
)

Usually, layers that produce the same feature map spatial size are defined as one “stage” (in paper Feature Pyramid Networks for Object Detection). Under such definition, stride_per_block[1:] should all be 1.

static make_default_stages(depth, block_class=None, **kwargs)[source]

Created list of ResNet stages from pre-defined depth (one of 18, 34, 50, 101, 152). If it doesn’t create the ResNet variant you need, please use make_stage() instead for fine-grained customization.

Parameters

depth (int) – depth of ResNet
block_class (type) – the CNN block class. Has to accept bottleneck_channels argument for depth > 50. By default it is BasicBlock or BottleneckBlock, based on the depth.
kwargs – other arguments to pass to make_stage. Should not contain stride and channels, as they are predefined for each depth.

Returns

modules in all stages; see arguments of ResNet.

Return type

list[list[detectron2.layers.CNNBlockBase]]

detrex.modeling.backbone.make_stage(depth: int = 50, norm: float = 'FrozenBN', num_groups: int = 1, width_per_group: int = 64, in_channels: int = 64, out_channels: int = 256, stride_in_1x1: bool = False, res5_dilation: int = 1, deform_on_per_stage: List[bool] = [False, False, False, False], deform_modulated: bool = False, deform_num_groups: int = 1)[source]

Modified from detectron2.modeling.backbone.build_resnet_backbone

Create a list of blocks of the same type that forms one ResNet stage.

Parameters

depth (int) – The depth of ResNet. Default: 50.
norm (str or callable) – Normalization for all conv layers. See detectron2.layers.get_norm() for supported format. Default: FrozenBN.
num_groups (int) – The number of groups for the 3x3 conv layer. Default: 1.
width_per_group (int) – Baseline width of each group. Scaling this parameters will scale the width of all bottleneck layers. Default: 64.
in_channels (int) – Output feature channels of the Stem Block. Needs to be set to 64 for R18 and R34. Default: 64.
out_channels (int) – Output width of res2. Scaling this parameters will scale the width of all 1x1 convs in ResNet. Default: 256.
stride_in_1x1 (bool) – Place the stride 2 conv on the 1x1 filter. Use True only for the original MSRA ResNet; use False for C2 and Torch models. Default: False.
res5_dilation (int) – Apply dilation in stage “res5”. Default: 1.
deform_on_per_stage (List[bool]) – Apply Deformable Convolution in stages. Specify if apply deform_conv on Res2, Res3, Res4, Res5. Default: [False, False, False, False].
deform_modulated – Use True to use modulated deform_conv (DeformableV2, https://arxiv.org/abs/1811.11168); Use False for DeformableV1. Default: False.
deform_num_groups (int) – Number of groups in deformable conv. Default: 1.

Returns

a list of block module.

Return type

list[detectron2.layers.CNNBlockBase]

Examples:

from detrex.modeling.backbone import make_stage, ResNet, BasicStem

resnet50_dc5 = ResNet(
    stem=BasicStem(in_channels=3, out_channels=64, norm="FrozenBN"),
    stages=make_stage(
        depth=50,
        norm="FrozenBN",
        in_channels=64,
        out_channels=256,
        res5_dilation=2,
    ),
    out_features=["res2", "res3", "res4", "res5"],
    freeze_at=1,
)

class detrex.modeling.backbone.ConvNeXt(in_chans=3, depths=[3, 3, 9, 3], dims=[96, 192, 384, 768], drop_path_rate=0.0, layer_scale_init_value=1e-06, out_indices=(0, 1, 2, 3), frozen_stages=-1)[source]

Implement paper A ConvNet for the 2020s.

Parameters

in_chans (int) – Number of input image channels. Default: 3
depths (Sequence[int]) – Number of blocks at each stage. Default: [3, 3, 9, 3]
dims (List[int]) – Feature dimension at each stage. Default: [96, 192, 384, 768]
drop_path_rate (float) – Stochastic depth rate. Default: 0.
layer_scale_init_value (float) – Init value for Layer Scale. Default: 1e-6.
out_indices (Sequence[int]) – Output from which stages.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.

forward(x)[source]

Forward function of ConvNeXt.

Parameters: x (torch.Tensor) – the input tensor for feature extraction.
Returns: mapping from feature name (e.g., “p1”) to tensor
Return type: dict[str->Tensor]

class detrex.modeling.backbone.FocalNet(pretrain_img_size=1600, patch_size=4, in_chans=3, embed_dim=96, depths=[2, 2, 6, 2], mlp_ratio=4.0, drop_rate=0.0, drop_path_rate=0.3, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, patch_norm=True, out_indices=(0, 1, 2, 3), frozen_stages=-1, focal_levels=[3, 3, 3, 3], focal_windows=[3, 3, 3, 3], use_conv_embed=False, use_postln=False, use_postln_in_modulation=False, use_layerscale=False, normalize_modulator=False, use_checkpoint=False)[source]

Implement paper Focal Modulation Networks

Parameters

pretrain_img_size (int) – Input image size for training the pretrained model, used in absolute postion embedding. Default 224.
patch_size (int | tuple(int)) – Patch size. Default: 4.
in_chans (int) – Number of input image channels. Default: 3.
embed_dim (int) – Number of linear projection output channels. Default: 96.
depths (tuple[int]) – Depths of each Swin Transformer stage.
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Default: 4.
drop_rate (float) – Dropout rate.
drop_path_rate (float) – Stochastic depth rate. Default: 0.2.
norm_layer (nn.Module) – Normalization layer. Default: nn.LayerNorm.
patch_norm (bool) – If True, add normalization after patch embedding. Default: True.
out_indices (Sequence[int]) – Output from which stages.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters.
focal_levels (Sequence[int]) – Number of focal levels at four stages
focal_windows (Sequence[int]) – Focal window sizes at first focal level at four stages
use_conv_embed (bool) – Whether use overlapped convolution for patch embedding
use_checkpoint (bool) – Whether to use checkpointing to save memory. Default: False.

forward(x)[source]

Forward function of FocalNet

Parameters: x (torch.Tensor) – the input tensor for feature extraction.
Returns: mapping from feature name (e.g., “p1”) to tensor
Return type: dict[str->Tensor]

neck

class detrex.modeling.neck.ChannelMapper(input_shapes: Dict[str, ShapeSpec], in_features: List[str], out_channels: int, kernel_size: int = 3, stride: int = 1, bias: bool = True, groups: int = 1, dilation: int = 1, norm_layer: Optional[Module] = None, activation: Optional[Module] = None, num_outs: Optional[int] = None, **kwargs)[source]

Channel Mapper for reduce/increase channels of backbone features. Modified from mmdet.

This is used to reduce/increase the channels of backbone features.

Parameters

input_shape (Dict[str, ShapeSpec]) – A dict which contains the backbone features meta infomation, e.g. input_shape = {"res5": ShapeSpec(channels=2048)}.
in_features (List[str]) – A list contains the keys which maps the features output from the backbone, e.g. in_features = ["res"].
out_channels (int) – Number of output channels for each scale.
kernel_size (int, optional) – Size of the convolving kernel for each scale. Default: 3.
stride (int, optional) – Stride of convolution for each scale. Default: 1.
bias (bool, optional) – If True, adds a learnable bias to the output of each scale. Default: True.
groups (int, optional) – Number of blocked connections from input channels to output channels for each scale. Default: 1.
dilation (int, optional) – Spacing between kernel elements for each scale. Default: 1.
norm_layer (nn.Module, optional) – The norm layer used for each scale. Default: None.
activation (nn.Module, optional) – The activation layer used for each scale. Default: None.
num_outs (int, optional) – Number of output feature maps. There will be extra_convs when num_outs is larger than the length of in_features. Default: None.

Examples

>>> import torch
>>> import torch.nn as nn
>>> from detrex.modeling import ChannelMapper
>>> from detectron2.modeling import ShapeSpec
>>> input_features = {
... "p0": torch.randn(1, 128, 128, 128),
... "p1": torch.randn(1, 256, 64, 64),
... "p2": torch.randn(1, 512, 32, 32),
... "p3": torch.randn(1, 1024, 16, 16),
... }
>>> input_shapes = {
... "p0": ShapeSpec(channels=128),
... "p1": ShapeSpec(channels=256),
... "p2": ShapeSpec(channels=512),
... "p3": ShapeSpec(channels=1024),
... }
>>> in_features = ["p0", "p1", "p2", "p3"]
>>> neck = ChannelMapper(
... input_shapes=input_shapes,
... in_features=in_features,
... out_channels=256,
... norm_layer=nn.GroupNorm(num_groups=32, num_channels=256)
>>> outputs = neck(input_features)
>>> for i in range(len(outputs)):
... print(f"output[{i}].shape = {outputs[i].shape}")
output[0].shape = torch.Size([1, 256, 128, 128])
output[1].shape = torch.Size([1, 256, 64, 64])
output[2].shape = torch.Size([1, 256, 32, 32])
output[3].shape = torch.Size([1, 256, 16, 16])

forward(inputs)[source]

Forward function for ChannelMapper

Parameters: inputs (Dict[str, torch.Tensor]) – The backbone feature maps.
Returns: A tuple of the processed features.
Return type: tuple(torch.Tensor)

matcher

class detrex.modeling.matcher.HungarianMatcher(cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1, cost_class_type: str = 'focal_loss_cost', alpha: float = 0.25, gamma: float = 2.0)[source]

HungarianMatcher which computes an assignment between targets and predictions.

For efficiency reasons, the targets don’t include the no_object. Because of this, in general, there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions, while the others are un-matched (and thus treated as non-objects).

Parameters

cost_class (float) – The relative weight of the classification error in the matching cost. Default: 1.
cost_bbox (float) – The relative weight of the L1 error of the bounding box coordinates in the matching cost. Default: 1.
cost_giou (float) – This is the relative weight of the giou loss of the bounding box in the matching cost. Default: 1.
cost_class_type (str) – How the classification error is calculated. Choose from ["ce_cost", "focal_loss_cost"]. Default: “focal_loss_cost”.
alpha (float) – Weighting factor in range (0, 1) to balance positive vs negative examples in focal loss. Default: 0.25.
gamma (float) – Exponent of modulating factor (1 - p_t) to balance easy vs hard examples in focal loss. Default: 2.

forward(outputs, targets)[source]

Forward function for HungarianMatcher which performs the matching.

Parameters

outputs (Dict[str, torch.Tensor]) –
This is a dict that contains at least these entries:
- "pred_logits": Tensor of shape (bs, num_queries, num_classes) with the classification logits.
- "pred_boxes": Tensor of shape (bs, num_queries, 4) with the predicted box coordinates.
targets (List[Dict[str, torch.Tensor]]) –
This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
- "labels": Tensor of shape (num_target_boxes, ) (where num_target_boxes is the number of ground-truth objects in the target) containing the class labels. # noqa
- "boxes": Tensor of shape (num_target_boxes, 4) containing the target box coordinates.

Returns

A list of size batch_size, containing tuples of (index_i, index_j) where:

index_i is the indices of the selected predictions (in order)

index_j is the indices of the corresponding selected targets (in order)

For each batch element, it holds: len(index_i) = len(index_j) = min(num_queries, num_target_boxes)

Return type

list[torch.Tensor]

losses

detrex.modeling.losses.sigmoid_focal_loss(preds, targets, weight=None, alpha: float = 0.25, gamma: float = 2, reduction: str = 'mean', avg_factor: Optional[int] = None)[source]

Loss used in RetinaNet for dense detection: https://arxiv.org/abs/1708.02002.

Parameters

preds (torch.Tensor) – A float tensor of arbitrary shape. The predictions for each example.
targets (torch.Tensor) – A float tensor with the same shape as inputs. Stores the binary classification label for each element in inputs (0 for the negative class and 1 for the positive class).
alpha (float, optional) – Weighting factor in range (0, 1) to balance positive vs negative examples. Default: 0.25.
gamma (float) – Exponent of the modulating factor (1 - p_t) to balance easy vs hard examples. Default: 2.
reduction – ‘none’ | ‘mean’ | ‘sum’ ‘none’: No reduction will be applied to the output. ‘mean’: The output will be averaged. ‘sum’: The output will be summed.
avg_factor (int) – Average factor that is used to average the loss. Default: None.

Returns

The computed sigmoid focal loss with the reduction option applied.

Return type

torch.Tensor

detrex.modeling.losses.dice_loss(preds, targets, weight=None, eps: float = 0.0001, reduction: str = 'mean', avg_factor: Optional[int] = None)[source]

Compute the DICE loss, similar to generalized IOU for masks

Parameters

preds (torch.Tensor) – A float tensor of arbitrary shape. The predictions for each example.
targets (torch.Tensor) – A float tensor with the same shape as inputs. Stores the binary classification label for each element in inputs (0 for the negative class and 1 for the positive class).
weight (torch.Tensor, optional) – The weight of loss for each prediction, has a shape (n,). Defaults to None.
eps (float) – Avoid dividing by zero. Default: 1e-4.
avg_factor (int, optional) – Average factor that is used to average the loss. Default: None.

Returns

The computed dice loss.

Return type

torch.Tensor