detrex.modeling
backbone
- class detrex.modeling.backbone.TimmBackbone(model_name: str, features_only: bool = True, pretrained: bool = False, checkpoint_path: str = '', in_channels: int = 3, out_indices: Tuple[int] = (0, 1, 2, 3), norm_layer: Optional[Module] = None)[source]
A wrapper for using backbone from timm library. Please see the document for feature extraction with timm for more details. :param model_name: Name of timm model to instantiate. :type model_name: str :param features_only: Whether to extract feature pyramid (multi-scale
feature maps from the deepest layer of each stage).
- Parameters
pretrained (bool) – Whether to load pretrained weights. Default: False.
checkpoint_path (str) – Whether to load pretrained weights. Default: False.
in_channels (int) – The number of input channels. Default: 3.
out_indices (tuple[str]) – The extracted feature indices which select specific feature levels or limit the stride of the feature extractor.
out_features (tuple[str]) – A map for the output feature dict, e.g., set (“p0”, “p1”) to return only the feature from indices (0, 1) as
{"p0": feature from indice 0, "p1": feature from indice 1}
.norm_layer (nn.Module) – Set the specified norm layer for feature extractor, e.g., set
norm_layer=FrozenBatchNorm2d
to freeze the norm layer in feature extractor.
- class detrex.modeling.backbone.TorchvisionBackbone(model_name: str = 'resnet50', pretrained: bool = False, return_nodes: Dict[str, str] = {'layer1': 'res2', 'layer2': 'res3', 'layer3': 'res4', 'layer4': 'res5'}, train_return_nodes: Optional[Dict[str, str]] = None, eval_return_nodes: Optional[Dict[str, str]] = None, tracer_kwargs: Optional[Dict[str, Any]] = None, suppress_diff_warnings: bool = False, **kwargs)[source]
A wrapper for torchvision pretrained backbones
Please check Feature extraction for model inspection for more details.
- Parameters
model_name (str) – Name of torchvision models. Default: resnet50.
pretrained (bool) – Whether to load pretrained weights. Default: False.
weights (Optional[ResNet50_Weights]) – The pretrained weights to use. Default: None.
return_nodes (Dict[str, str]) – The keys are the node names and the values are the user-specified keys for the graph module’s returned dictionary.
- class detrex.modeling.backbone.ResNet(stem, stages, num_classes=None, out_features=None, freeze_at=0)[source]
Implement paper Deep Residual Learning for Image Recognition.
- Parameters
stem (nn.Module) – a stem module.
stages (list[list[detectron2.layers.CNNBlockBase]]) – several (typically 4) stages, each contains multiple
detectron2.layers.CNNBlockBase
.num_classes (None or int) – if None, will not perform classification. Otherwise, will create a linear layer.
out_features (list[str]) – name of the layers whose outputs should be returned in forward. Can be anything in “stem”, “linear”, or “res2” … If None, will return the output of the last layer.
freeze_at (int) – The number of stages at the beginning to freeze. see
freeze()
for detailed explanation.
- forward(x)[source]
- Parameters
x – Tensor of shape (N,C,H,W). H, W must be a multiple of
self.size_divisibility
.- Returns
names and the corresponding features
- Return type
dict[str->Tensor]
- freeze(freeze_at=0)[source]
Freeze the first several stages of the ResNet. Commonly used in fine-tuning. Layers that produce the same feature map spatial size are defined as one “stage” by paper Feature Pyramid Networks for Object Detection.
- Parameters
freeze_at (int) – number of stages to freeze. 1 means freezing the stem. 2 means freezing the stem and one residual stage, etc.
- Returns
this ResNet itself
- Return type
nn.Module
- static make_stage(block_class, num_blocks, *, in_channels, out_channels, **kwargs)[source]
Create a list of blocks of the same type that forms one ResNet stage.
- Parameters
block_class (type) – a subclass of
detectron2.layers.CNNBlockBase
that’s used to create all blocks in this stage. A module of this type must not change spatial resolution of inputs unless its stride != 1.num_blocks (int) – number of blocks in this stage
in_channels (int) – input channels of the entire stage.
out_channels (int) – output channels of every block in the stage.
kwargs – other arguments passed to the constructor of block_class. If the argument name is “xx_per_block”, the argument is a list of values to be passed to each block in the stage. Otherwise, the same argument is passed to every block in the stage.
- Returns
a list of block module.
- Return type
list[detectron2.layers.CNNBlockBase]
Examples:
stage = ResNet.make_stage( BottleneckBlock, 3, in_channels=16, out_channels=64, bottleneck_channels=16, num_groups=1, stride_per_block=[2, 1, 1], dilations_per_block=[1, 1, 2] )
Usually, layers that produce the same feature map spatial size are defined as one “stage” (in paper Feature Pyramid Networks for Object Detection). Under such definition,
stride_per_block[1:]
should all be 1.
- static make_default_stages(depth, block_class=None, **kwargs)[source]
Created list of ResNet stages from pre-defined depth (one of 18, 34, 50, 101, 152). If it doesn’t create the ResNet variant you need, please use
make_stage()
instead for fine-grained customization.- Parameters
depth (int) – depth of ResNet
block_class (type) – the CNN block class. Has to accept bottleneck_channels argument for depth > 50. By default it is BasicBlock or BottleneckBlock, based on the depth.
kwargs – other arguments to pass to make_stage. Should not contain stride and channels, as they are predefined for each depth.
- Returns
modules in all stages; see arguments of
ResNet
.- Return type
list[list[detectron2.layers.CNNBlockBase]]
- detrex.modeling.backbone.make_stage(depth: int = 50, norm: float = 'FrozenBN', num_groups: int = 1, width_per_group: int = 64, in_channels: int = 64, out_channels: int = 256, stride_in_1x1: bool = False, res5_dilation: int = 1, deform_on_per_stage: List[bool] = [False, False, False, False], deform_modulated: bool = False, deform_num_groups: int = 1)[source]
Modified from detectron2.modeling.backbone.build_resnet_backbone
Create a list of blocks of the same type that forms one ResNet stage.
- Parameters
depth (int) – The depth of ResNet. Default: 50.
norm (str or callable) – Normalization for all conv layers. See
detectron2.layers.get_norm()
for supported format. Default: FrozenBN.num_groups (int) – The number of groups for the 3x3 conv layer. Default: 1.
width_per_group (int) – Baseline width of each group. Scaling this parameters will scale the width of all bottleneck layers. Default: 64.
in_channels (int) – Output feature channels of the Stem Block. Needs to be set to 64 for R18 and R34. Default: 64.
out_channels (int) – Output width of res2. Scaling this parameters will scale the width of all 1x1 convs in ResNet. Default: 256.
stride_in_1x1 (bool) – Place the stride 2 conv on the 1x1 filter. Use True only for the original MSRA ResNet; use False for C2 and Torch models. Default: False.
res5_dilation (int) – Apply dilation in stage “res5”. Default: 1.
deform_on_per_stage (List[bool]) – Apply Deformable Convolution in stages. Specify if apply deform_conv on Res2, Res3, Res4, Res5. Default: [False, False, False, False].
deform_modulated – Use True to use modulated deform_conv (DeformableV2, https://arxiv.org/abs/1811.11168); Use False for DeformableV1. Default: False.
deform_num_groups (int) – Number of groups in deformable conv. Default: 1.
- Returns
a list of block module.
- Return type
list[detectron2.layers.CNNBlockBase]
Examples:
from detrex.modeling.backbone import make_stage, ResNet, BasicStem resnet50_dc5 = ResNet( stem=BasicStem(in_channels=3, out_channels=64, norm="FrozenBN"), stages=make_stage( depth=50, norm="FrozenBN", in_channels=64, out_channels=256, res5_dilation=2, ), out_features=["res2", "res3", "res4", "res5"], freeze_at=1, )
- class detrex.modeling.backbone.ConvNeXt(in_chans=3, depths=[3, 3, 9, 3], dims=[96, 192, 384, 768], drop_path_rate=0.0, layer_scale_init_value=1e-06, out_indices=(0, 1, 2, 3), frozen_stages=-1)[source]
Implement paper A ConvNet for the 2020s.
- Parameters
in_chans (int) – Number of input image channels. Default: 3
depths (Sequence[int]) – Number of blocks at each stage. Default: [3, 3, 9, 3]
dims (List[int]) – Feature dimension at each stage. Default: [96, 192, 384, 768]
drop_path_rate (float) – Stochastic depth rate. Default: 0.
layer_scale_init_value (float) – Init value for Layer Scale. Default: 1e-6.
out_indices (Sequence[int]) – Output from which stages.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters. Default: -1.
- class detrex.modeling.backbone.FocalNet(pretrain_img_size=1600, patch_size=4, in_chans=3, embed_dim=96, depths=[2, 2, 6, 2], mlp_ratio=4.0, drop_rate=0.0, drop_path_rate=0.3, norm_layer=<class 'torch.nn.modules.normalization.LayerNorm'>, patch_norm=True, out_indices=(0, 1, 2, 3), frozen_stages=-1, focal_levels=[3, 3, 3, 3], focal_windows=[3, 3, 3, 3], use_conv_embed=False, use_postln=False, use_postln_in_modulation=False, use_layerscale=False, normalize_modulator=False, use_checkpoint=False)[source]
Implement paper Focal Modulation Networks
- Parameters
pretrain_img_size (int) – Input image size for training the pretrained model, used in absolute postion embedding. Default 224.
patch_size (int | tuple(int)) – Patch size. Default: 4.
in_chans (int) – Number of input image channels. Default: 3.
embed_dim (int) – Number of linear projection output channels. Default: 96.
depths (tuple[int]) – Depths of each Swin Transformer stage.
mlp_ratio (float) – Ratio of mlp hidden dim to embedding dim. Default: 4.
drop_rate (float) – Dropout rate.
drop_path_rate (float) – Stochastic depth rate. Default: 0.2.
norm_layer (nn.Module) – Normalization layer. Default: nn.LayerNorm.
patch_norm (bool) – If True, add normalization after patch embedding. Default: True.
out_indices (Sequence[int]) – Output from which stages.
frozen_stages (int) – Stages to be frozen (stop grad and set eval mode). -1 means not freezing any parameters.
focal_levels (Sequence[int]) – Number of focal levels at four stages
focal_windows (Sequence[int]) – Focal window sizes at first focal level at four stages
use_conv_embed (bool) – Whether use overlapped convolution for patch embedding
use_checkpoint (bool) – Whether to use checkpointing to save memory. Default: False.
neck
- class detrex.modeling.neck.ChannelMapper(input_shapes: Dict[str, ShapeSpec], in_features: List[str], out_channels: int, kernel_size: int = 3, stride: int = 1, bias: bool = True, groups: int = 1, dilation: int = 1, norm_layer: Optional[Module] = None, activation: Optional[Module] = None, num_outs: Optional[int] = None, **kwargs)[source]
Channel Mapper for reduce/increase channels of backbone features. Modified from mmdet.
This is used to reduce/increase the channels of backbone features.
- Parameters
input_shape (Dict[str, ShapeSpec]) – A dict which contains the backbone features meta infomation, e.g.
input_shape = {"res5": ShapeSpec(channels=2048)}
.in_features (List[str]) – A list contains the keys which maps the features output from the backbone, e.g.
in_features = ["res"]
.out_channels (int) – Number of output channels for each scale.
kernel_size (int, optional) – Size of the convolving kernel for each scale. Default: 3.
stride (int, optional) – Stride of convolution for each scale. Default: 1.
bias (bool, optional) – If True, adds a learnable bias to the output of each scale. Default: True.
groups (int, optional) – Number of blocked connections from input channels to output channels for each scale. Default: 1.
dilation (int, optional) – Spacing between kernel elements for each scale. Default: 1.
norm_layer (nn.Module, optional) – The norm layer used for each scale. Default: None.
activation (nn.Module, optional) – The activation layer used for each scale. Default: None.
num_outs (int, optional) – Number of output feature maps. There will be
extra_convs
whennum_outs
is larger than the length ofin_features
. Default: None.
Examples
>>> import torch >>> import torch.nn as nn >>> from detrex.modeling import ChannelMapper >>> from detectron2.modeling import ShapeSpec >>> input_features = { ... "p0": torch.randn(1, 128, 128, 128), ... "p1": torch.randn(1, 256, 64, 64), ... "p2": torch.randn(1, 512, 32, 32), ... "p3": torch.randn(1, 1024, 16, 16), ... } >>> input_shapes = { ... "p0": ShapeSpec(channels=128), ... "p1": ShapeSpec(channels=256), ... "p2": ShapeSpec(channels=512), ... "p3": ShapeSpec(channels=1024), ... } >>> in_features = ["p0", "p1", "p2", "p3"] >>> neck = ChannelMapper( ... input_shapes=input_shapes, ... in_features=in_features, ... out_channels=256, ... norm_layer=nn.GroupNorm(num_groups=32, num_channels=256) >>> outputs = neck(input_features) >>> for i in range(len(outputs)): ... print(f"output[{i}].shape = {outputs[i].shape}") output[0].shape = torch.Size([1, 256, 128, 128]) output[1].shape = torch.Size([1, 256, 64, 64]) output[2].shape = torch.Size([1, 256, 32, 32]) output[3].shape = torch.Size([1, 256, 16, 16])
matcher
- class detrex.modeling.matcher.HungarianMatcher(cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1, cost_class_type: str = 'focal_loss_cost', alpha: float = 0.25, gamma: float = 2.0)[source]
HungarianMatcher which computes an assignment between targets and predictions.
For efficiency reasons, the targets don’t include the no_object. Because of this, in general, there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions, while the others are un-matched (and thus treated as non-objects).
- Parameters
cost_class (float) – The relative weight of the classification error in the matching cost. Default: 1.
cost_bbox (float) – The relative weight of the L1 error of the bounding box coordinates in the matching cost. Default: 1.
cost_giou (float) – This is the relative weight of the giou loss of the bounding box in the matching cost. Default: 1.
cost_class_type (str) – How the classification error is calculated. Choose from
["ce_cost", "focal_loss_cost"]
. Default: “focal_loss_cost”.alpha (float) – Weighting factor in range (0, 1) to balance positive vs negative examples in focal loss. Default: 0.25.
gamma (float) – Exponent of modulating factor (1 - p_t) to balance easy vs hard examples in focal loss. Default: 2.
- forward(outputs, targets)[source]
Forward function for HungarianMatcher which performs the matching.
- Parameters
outputs (Dict[str, torch.Tensor]) –
This is a dict that contains at least these entries:
"pred_logits"
: Tensor of shape (bs, num_queries, num_classes) with the classification logits."pred_boxes"
: Tensor of shape (bs, num_queries, 4) with the predicted box coordinates.
targets (List[Dict[str, torch.Tensor]]) –
This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
"labels"
: Tensor of shape (num_target_boxes, ) (where num_target_boxes is the number of ground-truth objects in the target) containing the class labels. # noqa"boxes"
: Tensor of shape (num_target_boxes, 4) containing the target box coordinates.
- Returns
A list of size batch_size, containing tuples of (index_i, index_j) where:
index_i
is the indices of the selected predictions (in order)index_j
is the indices of the corresponding selected targets (in order)
For each batch element, it holds: len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
- Return type
list[torch.Tensor]
losses
- detrex.modeling.losses.sigmoid_focal_loss(preds, targets, weight=None, alpha: float = 0.25, gamma: float = 2, reduction: str = 'mean', avg_factor: Optional[int] = None)[source]
Loss used in RetinaNet for dense detection: https://arxiv.org/abs/1708.02002.
- Parameters
preds (torch.Tensor) – A float tensor of arbitrary shape. The predictions for each example.
targets (torch.Tensor) – A float tensor with the same shape as inputs. Stores the binary classification label for each element in inputs (0 for the negative class and 1 for the positive class).
alpha (float, optional) – Weighting factor in range (0, 1) to balance positive vs negative examples. Default: 0.25.
gamma (float) – Exponent of the modulating factor (1 - p_t) to balance easy vs hard examples. Default: 2.
reduction – ‘none’ | ‘mean’ | ‘sum’ ‘none’: No reduction will be applied to the output. ‘mean’: The output will be averaged. ‘sum’: The output will be summed.
avg_factor (int) – Average factor that is used to average the loss. Default: None.
- Returns
The computed sigmoid focal loss with the reduction option applied.
- Return type
torch.Tensor
- detrex.modeling.losses.dice_loss(preds, targets, weight=None, eps: float = 0.0001, reduction: str = 'mean', avg_factor: Optional[int] = None)[source]
Compute the DICE loss, similar to generalized IOU for masks
- Parameters
preds (torch.Tensor) – A float tensor of arbitrary shape. The predictions for each example.
targets (torch.Tensor) – A float tensor with the same shape as inputs. Stores the binary classification label for each element in inputs (0 for the negative class and 1 for the positive class).
weight (torch.Tensor, optional) – The weight of loss for each prediction, has a shape (n,). Defaults to None.
eps (float) – Avoid dividing by zero. Default: 1e-4.
avg_factor (int, optional) – Average factor that is used to average the loss. Default: None.
- Returns
The computed dice loss.
- Return type
torch.Tensor