detrex.layers

class detrex.layers.BaseTransformerLayer(attn: List[Module], ffn: Module, norm: Module, operation_order: Optional[tuple] = None)[source]

The implementation of Base TransformerLayer used in Transformer. Modified from mmcv.

It can be built by directly passing the Attentions, FFNs, Norms module, which support more flexible cusomization combined with LazyConfig system. The BaseTransformerLayer also supports prenorm when you specifying the norm as the first element of operation_order. More details about the prenorm: On Layer Normalization in the Transformer Architecture .

Parameters
  • attn (list[nn.Module] | nn.Module) – nn.Module or a list contains the attention module used in TransformerLayer.

  • ffn (nn.Module) – FFN module used in TransformerLayer.

  • norm (nn.Module) – Normalization layer used in TransformerLayer.

  • operation_order (tuple[str]) – The execution order of operation in transformer. Such as (‘self_attn’, ‘norm’, ‘ffn’, ‘norm’). Support prenorm when you specifying the first element as norm. Default = None.

forward(query: Tensor, key: Optional[Tensor] = None, value: Optional[Tensor] = None, query_pos: Optional[Tensor] = None, key_pos: Optional[Tensor] = None, attn_masks: Optional[List[Tensor]] = None, query_key_padding_mask: Optional[Tensor] = None, key_padding_mask: Optional[Tensor] = None, **kwargs)[source]

Forward function for BaseTransformerLayer.

**kwargs contains the specific arguments of attentions.

Parameters
  • query (torch.Tensor) – Query embeddings with shape (num_query, bs, embed_dim) or (bs, num_query, embed_dim) which should be specified follows the attention module used in BaseTransformerLayer.

  • key (torch.Tensor) – Key embeddings used in Attention.

  • value (torch.Tensor) – Value embeddings with the same shape as key.

  • query_pos (torch.Tensor) – The position embedding for query. Default: None.

  • key_pos (torch.Tensor) – The position embedding for key. Default: None.

  • attn_masks (List[Tensor] | None) – A list of 2D ByteTensor used in calculation the corresponding attention. The length of attn_masks should be equal to the number of attention in operation_order. Default: None.

  • query_key_padding_mask (torch.Tensor) – ByteTensor for query, with shape (bs, num_query). Only used in self_attn layer. Defaults to None.

  • key_padding_mask (torch.Tensor) – ByteTensor for key, with shape (bs, num_key). Default: None.

class detrex.layers.ConditionalCrossAttention(embed_dim, num_heads, attn_drop=0.0, proj_drop=0.0, batch_first=False, **kwargs)[source]

Conditional Cross-Attention Module used in Conditional-DETR

Conditional DETR for Fast Training Convergence.

Parameters
  • embed_dim (int) – The embedding dimension for attention.

  • num_heads (int) – The number of attention heads.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Default: 0.0.

  • proj_drop (float) – A Dropout layer after MultiheadAttention. Default: 0.0.

  • batch_first (bool) – if True, then the input and output tensor will be provided as (bs, n, embed_dim). Default: False. (n, bs, embed_dim)

forward(query, key=None, value=None, identity=None, query_pos=None, key_pos=None, query_sine_embed=None, is_first_layer=False, attn_mask=None, key_padding_mask=None, **kwargs)[source]

Forward function for ConditionalCrossAttention

**kwargs allow passing a more general data flow when combining with other operations in transformerlayer.

Parameters
  • query (torch.Tensor) – Query embeddings with shape (num_query, bs, embed_dim) if self.batch_first is False, else (bs, num_query, embed_dim)

  • key (torch.Tensor) – Key embeddings with shape (num_key, bs, embed_dim) if self.batch_first is False, else (bs, num_key, embed_dim)

  • value (torch.Tensor) – Value embeddings with the same shape as key. Same in torch.nn.MultiheadAttention.forward. Default: None. If None, the key will be used.

  • identity (torch.Tensor) – The tensor, with the same shape as x, will be used for identity addition. Default: None. If None, query will be used.

  • query_pos (torch.Tensor) – The position embedding for query, with the same shape as query. Default: None.

  • key_pos (torch.Tensor) – The position embedding for key. Default: None. If None, and query_pos has the same shape as key, then query_pos will be used for key_pos.

  • query_sine_embed (torch.Tensor) – None

  • is_first_layer (bool) – None

  • attn_mask (torch.Tensor) – ByteTensor mask with shape (num_query, num_key). Same as torch.nn.MultiheadAttention.forward. Default: None.

  • key_padding_mask (torch.Tensor) – ByteTensor with shape (bs, num_key) which indicates which elements within key to be ignored in attention. Default: None.

class detrex.layers.ConditionalSelfAttention(embed_dim, num_heads, attn_drop=0.0, proj_drop=0.0, batch_first=False, **kwargs)[source]

Conditional Self-Attention Module used in Conditional-DETR

Conditional DETR for Fast Training Convergence.

Parameters
  • embed_dim (int) – The embedding dimension for attention.

  • num_heads (int) – The number of attention heads.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Default: 0.0.

  • proj_drop (float) – A Dropout layer after MultiheadAttention. Default: 0.0.

  • batch_first (bool) – if True, then the input and output tensor will be provided as (bs, n, embed_dim). Default: False. (n, bs, embed_dim)

forward(query, key=None, value=None, identity=None, query_pos=None, key_pos=None, attn_mask=None, key_padding_mask=None, **kwargs)[source]

Forward function for ConditionalSelfAttention

**kwargs allow passing a more general data flow when combining with other operations in transformerlayer.

Parameters
  • query (torch.Tensor) – Query embeddings with shape (num_query, bs, embed_dim) if self.batch_first is False, else (bs, num_query, embed_dim)

  • key (torch.Tensor) – Key embeddings with shape (num_key, bs, embed_dim) if self.batch_first is False, else (bs, num_key, embed_dim)

  • value (torch.Tensor) – Value embeddings with the same shape as key. Same in torch.nn.MultiheadAttention.forward. Default: None. If None, the key will be used.

  • identity (torch.Tensor) – The tensor, with the same shape as query`, which will be used for identity addition. Default: None. If None, query will be used.

  • query_pos (torch.Tensor) – The position embedding for query, with the same shape as query. Default: None.

  • key_pos (torch.Tensor) – The position embedding for key. Default: None. If None, and query_pos has the same shape as key, then query_pos will be used for key_pos.

  • attn_mask (torch.Tensor) – ByteTensor mask with shape (num_query, num_key). Same as torch.nn.MultiheadAttention.forward. Default: None.

  • key_padding_mask (torch.Tensor) – ByteTensor with shape (bs, num_key) which indicates which elements within key to be ignored in attention. Default: None.

class detrex.layers.ConvNormAct(in_channels: int, out_channels: int, kernel_size: int = 1, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, bias: bool = True, norm_layer: Optional[Module] = None, activation: Optional[Module] = None, **kwargs)[source]

Utility module that stacks one convolution 2D layer, a normalization layer and an activation function.

Parameters
  • in_channels (int) – The number of input channels.

  • out_channels (int) – The number of output channels.

  • kernel_size (int) – Size of the convolving kernel. Default: 1.

  • stride (int) – Stride of convolution. Default: 1.

  • padding (int) – Padding added to all four sides of the input. Default: 0.

  • dilation (int) – Spacing between kernel elements. Default: 1.

  • groups (int) – Number of blocked connections from input channels to output channels. Default: 1.

  • bias (bool) – if True, adds a learnable bias to the output. Default: True.

  • norm_layer (nn.Module) – Normalization layer used in ConvNormAct. Default: None.

  • activation (nn.Module) – Activation layer used in ConvNormAct. Default: None.

forward(x)[source]

Forward function for ConvNormAct

class detrex.layers.FFN(embed_dim=256, feedforward_dim=1024, output_dim=None, num_fcs=2, activation=ReLU(inplace=True), ffn_drop=0.0, fc_bias=True, add_identity=True)[source]

The implementation of feed-forward networks (FFNs) with identity connection.

Parameters
  • embed_dim (int) – The feature dimension. Same as MultiheadAttention. Defaults: 256.

  • feedforward_dim (int) – The hidden dimension of FFNs. Defaults: 1024.

  • output_dim (int) – The output feature dimension of FFNs. Default: None. If None, the embed_dim will be used.

  • num_fcs (int, optional) – The number of fully-connected layers in FFNs. Default: 2.

  • activation (nn.Module) – The activation layer used in FFNs. Default: nn.ReLU(inplace=True).

  • ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Default 0.0.

  • add_identity (bool, optional) – Whether to add the identity connection. Default: True.

forward(x, identity=None) Tensor[source]

Forward function of FFN.

Parameters
  • x (torch.Tensor) – the input tensor used in FFN layers.

  • identity (torch.Tensor) – the tensor with the same shape as x, which will be used for identity addition. Default: None. if None, x will be used.

Returns

the forward results of FFN layer

Return type

torch.Tensor

class detrex.layers.GenerateDNQueries(num_queries: int = 300, num_classes: int = 80, label_embed_dim: int = 256, denoising_groups: int = 5, label_noise_prob: float = 0.2, box_noise_scale: float = 0.4, with_indicator: bool = False)[source]

Generate denoising queries for DN-DETR

Parameters
  • num_queries (int) – Number of total queries in DN-DETR. Default: 300

  • num_classes (int) – Number of total categories. Default: 80.

  • label_embed_dim (int) – The embedding dimension for label encoding. Default: 256.

  • denoising_groups (int) – Number of noised ground truth groups. Default: 5.

  • label_noise_prob (float) – The probability of the label being noised. Default: 0.2.

  • box_noise_scale (float) – Scaling factor for box noising. Default: 0.4

  • with_indicator (bool) – If True, add indicator in noised label/box queries.

forward(gt_labels_list, gt_boxes_list)[source]
Parameters
  • gt_boxes_list (list[torch.Tensor]) – Ground truth bounding boxes per image with normalized coordinates in format (x, y, w, h) in shape (num_gts, 4)

  • gt_labels_list (list[torch.Tensor]) – Classification labels per image in shape (num_gt, ).

class detrex.layers.LayerNorm(normalized_shape, eps=1e-06, channel_last=True)[source]

LayerNorm which supports both channel_last (default) and channel_first data format. The inputs data format should be as follows:

  • channel_last: (bs, h, w, channels)

  • channel_first: (bs, channels, h, w)

Parameters
  • normalized_shape (tuple) – The size of the input feature dim.

  • eps (float) – A value added to the denominator for numerical stability. Default: True.

  • channel_last (bool) – Set True for channel_last input data format. Default: True.

forward(x)[source]

Forward function for LayerNorm

class detrex.layers.MLP(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int)[source]

The implementation of simple multi-layer perceptron layer without dropout and identity connection.

The feature process order follows Linear -> ReLU -> Linear -> ReLU -> ….

Parameters
  • input_dim (int) – The input feature dimension.

  • hidden_dim (int) – The hidden dimension of MLPs.

  • output_dim (int) – the output feature dimension of MLPs.

  • num_layer (int) – The number of FC layer used in MLPs.

forward(x)[source]

Forward function of MLP.

Parameters

x (torch.Tensor) – the input tensor used in MLP layers.

Returns

the forward results of MLP layer

Return type

torch.Tensor

class detrex.layers.MultiheadAttention(embed_dim: int, num_heads: int, attn_drop: float = 0.0, proj_drop: float = 0.0, batch_first: bool = False, **kwargs)[source]

A wrapper for torch.nn.MultiheadAttention

Implemente MultiheadAttention with identity connection, and position embedding is also passed as input.

Parameters
  • embed_dim (int) – The embedding dimension for attention.

  • num_heads (int) – The number of attention heads.

  • attn_drop (float) – A Dropout layer on attn_output_weights. Default: 0.0.

  • proj_drop (float) – A Dropout layer after MultiheadAttention. Default: 0.0.

  • batch_first (bool) – if True, then the input and output tensor will be provided as (bs, n, embed_dim). Default: False. (n, bs, embed_dim)

forward(query: Tensor, key: Optional[Tensor] = None, value: Optional[Tensor] = None, identity: Optional[Tensor] = None, query_pos: Optional[Tensor] = None, key_pos: Optional[Tensor] = None, attn_mask: Optional[Tensor] = None, key_padding_mask: Optional[Tensor] = None, **kwargs) Tensor[source]

Forward function for MultiheadAttention

**kwargs allow passing a more general data flow when combining with other operations in transformerlayer.

Parameters
  • query (torch.Tensor) – Query embeddings with shape (num_query, bs, embed_dim) if self.batch_first is False, else (bs, num_query, embed_dim)

  • key (torch.Tensor) – Key embeddings with shape (num_key, bs, embed_dim) if self.batch_first is False, else (bs, num_key, embed_dim)

  • value (torch.Tensor) – Value embeddings with the same shape as key. Same in torch.nn.MultiheadAttention.forward. Default: None. If None, the key will be used.

  • identity (torch.Tensor) – The tensor, with the same shape as x, will be used for identity addition. Default: None. If None, query will be used.

  • query_pos (torch.Tensor) – The position embedding for query, with the same shape as query. Default: None.

  • key_pos (torch.Tensor) – The position embedding for key. Default: None. If None, and query_pos has the same shape as key, then query_pos will be used for key_pos.

  • attn_mask (torch.Tensor) – ByteTensor mask with shape (num_query, num_key). Same as torch.nn.MultiheadAttention.forward. Default: None.

  • key_padding_mask (torch.Tensor) – ByteTensor with shape (bs, num_key) which indicates which elements within key to be ignored in attention. Default: None.

class detrex.layers.PositionEmbeddingLearned(num_pos_feats: int = 256, row_num_embed: int = 50, col_num_embed: int = 50)[source]

Position embedding with learnable embedding weights.

Parameters
  • num_pos_feats (int) – The feature dimension for each position along x-axis or y-axis. The final returned dimension for each position is 2 times of the input value.

  • row_num_embed (int, optional) – The dictionary size of row embeddings. Default: 50.

  • col_num_embed (int, optional) – The dictionary size of column embeddings. Default: 50.

forward(mask)[source]

Forward function for PositionEmbeddingLearned.

Parameters

mask (torch.Tensor) – ByteTensor mask. Non-zero values representing ignored positions, while zero values means valid positions for the input tensor. Shape as (bs, h, w).

Returns

Returned position embedding with shape (bs, num_pos_feats * 2, h, w)

Return type

torch.Tensor

class detrex.layers.PositionEmbeddingSine(num_pos_feats: int = 64, temperature: int = 10000, scale: float = 6.283185307179586, eps: float = 1e-06, offset: float = 0.0, normalize: bool = False)[source]

Sinusoidal position embedding used in DETR model.

Please see End-to-End Object Detection with Transformers for more details.

Parameters
  • num_pos_feats (int) – The feature dimension for each position along x-axis or y-axis. The final returned dimension for each position is 2 times of the input value.

  • temperature (int, optional) – The temperature used for scaling the position embedding. Default: 10000.

  • scale (float, optional) – A scale factor that scales the position embedding. The scale will be used only when normalize is True. Default: 2*pi.

  • eps (float, optional) – A value added to the denominator for numerical stability. Default: 1e-6.

  • offset (float) – An offset added to embed when doing normalization.

  • normalize (bool, optional) – Whether to normalize the position embedding. Default: False.

forward(mask: Tensor, **kwargs) Tensor[source]

Forward function for PositionEmbeddingSine.

Parameters

mask (torch.Tensor) – ByteTensor mask. Non-zero values representing ignored positions, while zero values means valid positions for the input tensor. Shape as (bs, h, w).

Returns

Returned position embedding with shape (bs, num_pos_feats * 2, h, w)

Return type

torch.Tensor

class detrex.layers.TransformerLayerSequence(transformer_layers=None, num_layers=None)[source]

Base class for TransformerEncoder and TransformerDecoder, which will copy the passed transformer_layers module num_layers time or save the passed list of transformer_layers as parameters named self.layers which is the type of nn.ModuleList. The users should inherit TransformerLayerSequence and implemente their own forward function.

Parameters
  • transformer_layers (list[BaseTransformerLayer] | BaseTransformerLayer) – A list of BaseTransformerLayer. If it is obj:BaseTransformerLayer, it would be repeated num_layers times to a list[BaseTransformerLayer]

  • num_layers (int) – The number of TransformerLayer. Default: None.

forward()[source]

Forward function of TransformerLayerSequence. The users should inherit TransformerLayerSequence and implemente their own forward function.

detrex.layers.apply_box_noise(boxes: Tensor, box_noise_scale: float = 0.4)[source]
Parameters
  • boxes (torch.Tensor) – Bounding boxes in format (x_c, y_c, w, h) with shape (num_boxes, 4)

  • box_noise_scale (float) – Scaling factor for box noising. Default: 0.4.

detrex.layers.apply_label_noise(labels: Tensor, label_noise_prob: float = 0.2, num_classes: int = 80)[source]
Parameters
  • labels (torch.Tensor) – Classification labels with (num_labels, ).

  • label_noise_prob (float) – The probability of the label being noised. Default: 0.2.

  • num_classes (int) – Number of total categories.

Returns

The noised labels the same shape as labels.

Return type

torch.Tensor

detrex.layers.box_cxcywh_to_xyxy(bbox) Tensor[source]

Convert bbox coordinates from (cx, cy, w, h) to (x1, y1, x2, y2)

Parameters

bbox (torch.Tensor) – Shape (n, 4) for bboxes.

Returns

Converted bboxes.

Return type

torch.Tensor

detrex.layers.box_iou(boxes1, boxes2) Tuple[Tensor][source]

Modified from torchvision.ops.box_iou

Return both intersection-over-union (Jaccard index) and union between two sets of boxes.

Parameters
  • boxes1 – (torch.Tensor[N, 4]): first set of boxes

  • boxes2 – (torch.Tensor[M, 4]): second set of boxes

Returns

A tuple of NxM matrix, with shape (torch.Tensor[N, M], torch.Tensor[N, M]), containing the pairwise IoU and union values for every element in boxes1 and boxes2.

Return type

Tuple

detrex.layers.box_xyxy_to_cxcywh(bbox) Tensor[source]

Convert bbox coordinates from (x1, y1, x2, y2) to (cx, cy, w, h)

Parameters

bbox (torch.Tensor) – Shape (n, 4) for bboxes.

Returns

Converted bboxes.

Return type

torch.Tensor

detrex.layers.generalized_box_iou(boxes1, boxes2) Tensor[source]

Generalized IoU from https://giou.stanford.edu/

The input boxes should be in (x0, y0, x1, y1) format

Parameters
  • boxes1 – (torch.Tensor[N, 4]): first set of boxes

  • boxes2 – (torch.Tensor[M, 4]): second set of boxes

Returns

a NxM pairwise matrix containing the pairwise Generalized IoU for every element in boxes1 and boxes2.

Return type

torch.Tensor

detrex.layers.get_sine_pos_embed(pos_tensor: Tensor, num_pos_feats: int = 128, temperature: int = 10000, exchange_xy: bool = True) Tensor[source]

generate sine position embedding from a position tensor

Parameters
  • pos_tensor (torch.Tensor) – Shape as (None, n).

  • num_pos_feats (int) – projected shape for each float in the tensor. Default: 128

  • temperature (int) – The temperature used for scaling the position embedding. Default: 10000.

  • exchange_xy (bool, optional) – exchange pos x and pos y. For example, input tensor is [x, y], the results will # noqa be [pos(y), pos(x)]. Defaults: True.

Returns

Returned position embedding # noqa with shape (None, n * num_pos_feats).

Return type

torch.Tensor

detrex.layers.masks_to_boxes(masks) Tensor[source]

Compute the bounding boxes around the provided masks

The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions.

Returns

a [N, 4] tensor with the boxes in (x0, y0, x1, y1) format.

Return type

torch.Tensor