detrex.layers

class detrex.layers.BaseTransformerLayer(attn: List[Module], ffn: Module, norm: Module, operation_order: Optional[tuple] = None)[source]

The implementation of Base TransformerLayer used in Transformer. Modified from mmcv.

It can be built by directly passing the Attentions, FFNs, Norms module, which support more flexible cusomization combined with LazyConfig system. The BaseTransformerLayer also supports prenorm when you specifying the norm as the first element of operation_order. More details about the prenorm: On Layer Normalization in the Transformer Architecture .

Parameters

attn (list[nn.Module] | nn.Module) – nn.Module or a list contains the attention module used in TransformerLayer.
ffn (nn.Module) – FFN module used in TransformerLayer.
norm (nn.Module) – Normalization layer used in TransformerLayer.
operation_order (tuple[str]) – The execution order of operation in transformer. Such as (‘self_attn’, ‘norm’, ‘ffn’, ‘norm’). Support prenorm when you specifying the first element as norm. Default = None.

forward(query: Tensor, key: Optional[Tensor] = None, value: Optional[Tensor] = None, query_pos: Optional[Tensor] = None, key_pos: Optional[Tensor] = None, attn_masks: Optional[List[Tensor]] = None, query_key_padding_mask: Optional[Tensor] = None, key_padding_mask: Optional[Tensor] = None, **kwargs)[source]

Forward function for BaseTransformerLayer.

**kwargs contains the specific arguments of attentions.

Parameters

query (torch.Tensor) – Query embeddings with shape (num_query, bs, embed_dim) or (bs, num_query, embed_dim) which should be specified follows the attention module used in BaseTransformerLayer.
key (torch.Tensor) – Key embeddings used in Attention.
value (torch.Tensor) – Value embeddings with the same shape as key.
query_pos (torch.Tensor) – The position embedding for query. Default: None.
key_pos (torch.Tensor) – The position embedding for key. Default: None.
attn_masks (List[Tensor] | None) – A list of 2D ByteTensor used in calculation the corresponding attention. The length of attn_masks should be equal to the number of attention in operation_order. Default: None.
query_key_padding_mask (torch.Tensor) – ByteTensor for query, with shape (bs, num_query). Only used in self_attn layer. Defaults to None.
key_padding_mask (torch.Tensor) – ByteTensor for key, with shape (bs, num_key). Default: None.

class detrex.layers.ConditionalCrossAttention(embed_dim, num_heads, attn_drop=0.0, proj_drop=0.0, batch_first=False, **kwargs)[source]

Conditional Cross-Attention Module used in Conditional-DETR

Conditional DETR for Fast Training Convergence.

Parameters

embed_dim (int) – The embedding dimension for attention.
num_heads (int) – The number of attention heads.
attn_drop (float) – A Dropout layer on attn_output_weights. Default: 0.0.
proj_drop (float) – A Dropout layer after MultiheadAttention. Default: 0.0.
batch_first (bool) – if True, then the input and output tensor will be provided as (bs, n, embed_dim). Default: False. (n, bs, embed_dim)

forward(query, key=None, value=None, identity=None, query_pos=None, key_pos=None, query_sine_embed=None, is_first_layer=False, attn_mask=None, key_padding_mask=None, **kwargs)[source]

Forward function for ConditionalCrossAttention

**kwargs allow passing a more general data flow when combining with other operations in transformerlayer.

Parameters

query (torch.Tensor) – Query embeddings with shape (num_query, bs, embed_dim) if self.batch_first is False, else (bs, num_query, embed_dim)
key (torch.Tensor) – Key embeddings with shape (num_key, bs, embed_dim) if self.batch_first is False, else (bs, num_key, embed_dim)
value (torch.Tensor) – Value embeddings with the same shape as key. Same in torch.nn.MultiheadAttention.forward. Default: None. If None, the key will be used.
identity (torch.Tensor) – The tensor, with the same shape as x, will be used for identity addition. Default: None. If None, query will be used.
query_pos (torch.Tensor) – The position embedding for query, with the same shape as query. Default: None.
key_pos (torch.Tensor) – The position embedding for key. Default: None. If None, and query_pos has the same shape as key, then query_pos will be used for key_pos.
query_sine_embed (torch.Tensor) – None
is_first_layer (bool) – None
attn_mask (torch.Tensor) – ByteTensor mask with shape (num_query, num_key). Same as torch.nn.MultiheadAttention.forward. Default: None.
key_padding_mask (torch.Tensor) – ByteTensor with shape (bs, num_key) which indicates which elements within key to be ignored in attention. Default: None.

class detrex.layers.ConditionalSelfAttention(embed_dim, num_heads, attn_drop=0.0, proj_drop=0.0, batch_first=False, **kwargs)[source]

Conditional Self-Attention Module used in Conditional-DETR

Conditional DETR for Fast Training Convergence.

Parameters

embed_dim (int) – The embedding dimension for attention.
num_heads (int) – The number of attention heads.
attn_drop (float) – A Dropout layer on attn_output_weights. Default: 0.0.
proj_drop (float) – A Dropout layer after MultiheadAttention. Default: 0.0.
batch_first (bool) – if True, then the input and output tensor will be provided as (bs, n, embed_dim). Default: False. (n, bs, embed_dim)

forward(query, key=None, value=None, identity=None, query_pos=None, key_pos=None, attn_mask=None, key_padding_mask=None, **kwargs)[source]

Forward function for ConditionalSelfAttention

**kwargs allow passing a more general data flow when combining with other operations in transformerlayer.

Parameters

query (torch.Tensor) – Query embeddings with shape (num_query, bs, embed_dim) if self.batch_first is False, else (bs, num_query, embed_dim)
key (torch.Tensor) – Key embeddings with shape (num_key, bs, embed_dim) if self.batch_first is False, else (bs, num_key, embed_dim)
value (torch.Tensor) – Value embeddings with the same shape as key. Same in torch.nn.MultiheadAttention.forward. Default: None. If None, the key will be used.
identity (torch.Tensor) – The tensor, with the same shape as query`, which will be used for identity addition. Default: None. If None, query will be used.
query_pos (torch.Tensor) – The position embedding for query, with the same shape as query. Default: None.
key_pos (torch.Tensor) – The position embedding for key. Default: None. If None, and query_pos has the same shape as key, then query_pos will be used for key_pos.
attn_mask (torch.Tensor) – ByteTensor mask with shape (num_query, num_key). Same as torch.nn.MultiheadAttention.forward. Default: None.
key_padding_mask (torch.Tensor) – ByteTensor with shape (bs, num_key) which indicates which elements within key to be ignored in attention. Default: None.

class detrex.layers.ConvNormAct(in_channels: int, out_channels: int, kernel_size: int = 1, stride: int = 1, padding: int = 0, dilation: int = 1, groups: int = 1, bias: bool = True, norm_layer: Optional[Module] = None, activation: Optional[Module] = None, **kwargs)[source]

Utility module that stacks one convolution 2D layer, a normalization layer and an activation function.

Parameters

in_channels (int) – The number of input channels.
out_channels (int) – The number of output channels.
kernel_size (int) – Size of the convolving kernel. Default: 1.
stride (int) – Stride of convolution. Default: 1.
padding (int) – Padding added to all four sides of the input. Default: 0.
dilation (int) – Spacing between kernel elements. Default: 1.
groups (int) – Number of blocked connections from input channels to output channels. Default: 1.
bias (bool) – if True, adds a learnable bias to the output. Default: True.
norm_layer (nn.Module) – Normalization layer used in ConvNormAct. Default: None.
activation (nn.Module) – Activation layer used in ConvNormAct. Default: None.

forward(x)[source]: Forward function for ConvNormAct

class detrex.layers.FFN(embed_dim=256, feedforward_dim=1024, output_dim=None, num_fcs=2, activation=ReLU(inplace=True), ffn_drop=0.0, fc_bias=True, add_identity=True)[source]

The implementation of feed-forward networks (FFNs) with identity connection.

Parameters

embed_dim (int) – The feature dimension. Same as MultiheadAttention. Defaults: 256.
feedforward_dim (int) – The hidden dimension of FFNs. Defaults: 1024.
output_dim (int) – The output feature dimension of FFNs. Default: None. If None, the embed_dim will be used.
num_fcs (int, optional) – The number of fully-connected layers in FFNs. Default: 2.
activation (nn.Module) – The activation layer used in FFNs. Default: nn.ReLU(inplace=True).
ffn_drop (float, optional) – Probability of an element to be zeroed in FFN. Default 0.0.
add_identity (bool, optional) – Whether to add the identity connection. Default: True.

forward(x, identity=None) → Tensor[source]

Forward function of FFN.

Parameters

x (torch.Tensor) – the input tensor used in FFN layers.
identity (torch.Tensor) – the tensor with the same shape as x, which will be used for identity addition. Default: None. if None, x will be used.

Returns

the forward results of FFN layer

Return type

torch.Tensor

class detrex.layers.GenerateDNQueries(num_queries: int = 300, num_classes: int = 80, label_embed_dim: int = 256, denoising_groups: int = 5, label_noise_prob: float = 0.2, box_noise_scale: float = 0.4, with_indicator: bool = False)[source]

Generate denoising queries for DN-DETR

Parameters

num_queries (int) – Number of total queries in DN-DETR. Default: 300
num_classes (int) – Number of total categories. Default: 80.
label_embed_dim (int) – The embedding dimension for label encoding. Default: 256.
denoising_groups (int) – Number of noised ground truth groups. Default: 5.
label_noise_prob (float) – The probability of the label being noised. Default: 0.2.
box_noise_scale (float) – Scaling factor for box noising. Default: 0.4
with_indicator (bool) – If True, add indicator in noised label/box queries.

forward(gt_labels_list, gt_boxes_list)[source]

Parameters

gt_boxes_list (list[torch.Tensor]) – Ground truth bounding boxes per image with normalized coordinates in format (x, y, w, h) in shape (num_gts, 4)
gt_labels_list (list[torch.Tensor]) – Classification labels per image in shape (num_gt, ).

class detrex.layers.LayerNorm(normalized_shape, eps=1e-06, channel_last=True)[source]

LayerNorm which supports both channel_last (default) and channel_first data format. The inputs data format should be as follows:

channel_last: (bs, h, w, channels)

channel_first: (bs, channels, h, w)

Parameters

normalized_shape (tuple) – The size of the input feature dim.
eps (float) – A value added to the denominator for numerical stability. Default: True.
channel_last (bool) – Set True for channel_last input data format. Default: True.

forward(x)[source]: Forward function for LayerNorm

class detrex.layers.MLP(input_dim: int, hidden_dim: int, output_dim: int, num_layers: int)[source]

The implementation of simple multi-layer perceptron layer without dropout and identity connection.

The feature process order follows Linear -> ReLU -> Linear -> ReLU -> ….

Parameters

input_dim (int) – The input feature dimension.
hidden_dim (int) – The hidden dimension of MLPs.
output_dim (int) – the output feature dimension of MLPs.
num_layer (int) – The number of FC layer used in MLPs.

forward(x)[source]

Forward function of MLP.

Parameters: x (torch.Tensor) – the input tensor used in MLP layers.
Returns: the forward results of MLP layer
Return type: torch.Tensor

class detrex.layers.MultiheadAttention(embed_dim: int, num_heads: int, attn_drop: float = 0.0, proj_drop: float = 0.0, batch_first: bool = False, **kwargs)[source]

A wrapper for torch.nn.MultiheadAttention

Implemente MultiheadAttention with identity connection, and position embedding is also passed as input.

Parameters

embed_dim (int) – The embedding dimension for attention.
num_heads (int) – The number of attention heads.
attn_drop (float) – A Dropout layer on attn_output_weights. Default: 0.0.
proj_drop (float) – A Dropout layer after MultiheadAttention. Default: 0.0.
batch_first (bool) – if True, then the input and output tensor will be provided as (bs, n, embed_dim). Default: False. (n, bs, embed_dim)

forward(query: Tensor, key: Optional[Tensor] = None, value: Optional[Tensor] = None, identity: Optional[Tensor] = None, query_pos: Optional[Tensor] = None, key_pos: Optional[Tensor] = None, attn_mask: Optional[Tensor] = None, key_padding_mask: Optional[Tensor] = None, **kwargs) → Tensor[source]

Forward function for MultiheadAttention

**kwargs allow passing a more general data flow when combining with other operations in transformerlayer.

Parameters

query (torch.Tensor) – Query embeddings with shape (num_query, bs, embed_dim) if self.batch_first is False, else (bs, num_query, embed_dim)
key (torch.Tensor) – Key embeddings with shape (num_key, bs, embed_dim) if self.batch_first is False, else (bs, num_key, embed_dim)
value (torch.Tensor) – Value embeddings with the same shape as key. Same in torch.nn.MultiheadAttention.forward. Default: None. If None, the key will be used.
identity (torch.Tensor) – The tensor, with the same shape as x, will be used for identity addition. Default: None. If None, query will be used.
query_pos (torch.Tensor) – The position embedding for query, with the same shape as query. Default: None.
key_pos (torch.Tensor) – The position embedding for key. Default: None. If None, and query_pos has the same shape as key, then query_pos will be used for key_pos.
attn_mask (torch.Tensor) – ByteTensor mask with shape (num_query, num_key). Same as torch.nn.MultiheadAttention.forward. Default: None.
key_padding_mask (torch.Tensor) – ByteTensor with shape (bs, num_key) which indicates which elements within key to be ignored in attention. Default: None.

class detrex.layers.PositionEmbeddingLearned(num_pos_feats: int = 256, row_num_embed: int = 50, col_num_embed: int = 50)[source]

Position embedding with learnable embedding weights.

Parameters

num_pos_feats (int) – The feature dimension for each position along x-axis or y-axis. The final returned dimension for each position is 2 times of the input value.
row_num_embed (int, optional) – The dictionary size of row embeddings. Default: 50.
col_num_embed (int, optional) – The dictionary size of column embeddings. Default: 50.

forward(mask)[source]

Forward function for PositionEmbeddingLearned.

Parameters: mask (torch.Tensor) – ByteTensor mask. Non-zero values representing ignored positions, while zero values means valid positions for the input tensor. Shape as (bs, h, w).
Returns: Returned position embedding with shape (bs, num_pos_feats * 2, h, w)
Return type: torch.Tensor

class detrex.layers.PositionEmbeddingSine(num_pos_feats: int = 64, temperature: int = 10000, scale: float = 6.283185307179586, eps: float = 1e-06, offset: float = 0.0, normalize: bool = False)[source]

Sinusoidal position embedding used in DETR model.

Please see End-to-End Object Detection with Transformers for more details.

Parameters

num_pos_feats (int) – The feature dimension for each position along x-axis or y-axis. The final returned dimension for each position is 2 times of the input value.
temperature (int, optional) – The temperature used for scaling the position embedding. Default: 10000.
scale (float, optional) – A scale factor that scales the position embedding. The scale will be used only when normalize is True. Default: 2*pi.
eps (float, optional) – A value added to the denominator for numerical stability. Default: 1e-6.
offset (float) – An offset added to embed when doing normalization.
normalize (bool, optional) – Whether to normalize the position embedding. Default: False.

forward(mask: Tensor, **kwargs) → Tensor[source]

Forward function for PositionEmbeddingSine.

Parameters: mask (torch.Tensor) – ByteTensor mask. Non-zero values representing ignored positions, while zero values means valid positions for the input tensor. Shape as (bs, h, w).
Returns: Returned position embedding with shape (bs, num_pos_feats * 2, h, w)
Return type: torch.Tensor

class detrex.layers.TransformerLayerSequence(transformer_layers=None, num_layers=None)[source]

Base class for TransformerEncoder and TransformerDecoder, which will copy the passed transformer_layers module num_layers time or save the passed list of transformer_layers as parameters named self.layers which is the type of nn.ModuleList. The users should inherit TransformerLayerSequence and implemente their own forward function.

Parameters

transformer_layers (list[BaseTransformerLayer] | BaseTransformerLayer) – A list of BaseTransformerLayer. If it is obj:BaseTransformerLayer, it would be repeated num_layers times to a list[BaseTransformerLayer]
num_layers (int) – The number of TransformerLayer. Default: None.

forward()[source]: Forward function of TransformerLayerSequence. The users should inherit TransformerLayerSequence and implemente their own forward function.

detrex.layers.apply_box_noise(boxes: Tensor, box_noise_scale: float = 0.4)[source]

Parameters

boxes (torch.Tensor) – Bounding boxes in format (x_c, y_c, w, h) with shape (num_boxes, 4)
box_noise_scale (float) – Scaling factor for box noising. Default: 0.4.

detrex.layers.apply_label_noise(labels: Tensor, label_noise_prob: float = 0.2, num_classes: int = 80)[source]

Parameters

labels (torch.Tensor) – Classification labels with (num_labels, ).
label_noise_prob (float) – The probability of the label being noised. Default: 0.2.
num_classes (int) – Number of total categories.

Returns

The noised labels the same shape as labels.

Return type

torch.Tensor

detrex.layers.box_cxcywh_to_xyxy(bbox) → Tensor[source]

Convert bbox coordinates from (cx, cy, w, h) to (x1, y1, x2, y2)

Parameters: bbox (torch.Tensor) – Shape (n, 4) for bboxes.
Returns: Converted bboxes.
Return type: torch.Tensor

detrex.layers.box_iou(boxes1, boxes2) → Tuple[Tensor][source]

Modified from torchvision.ops.box_iou

Return both intersection-over-union (Jaccard index) and union between two sets of boxes.

Parameters

boxes1 – (torch.Tensor[N, 4]): first set of boxes
boxes2 – (torch.Tensor[M, 4]): second set of boxes

Returns

A tuple of NxM matrix, with shape (torch.Tensor[N, M], torch.Tensor[N, M]), containing the pairwise IoU and union values for every element in boxes1 and boxes2.

Return type

Tuple

detrex.layers.box_xyxy_to_cxcywh(bbox) → Tensor[source]

Convert bbox coordinates from (x1, y1, x2, y2) to (cx, cy, w, h)

Parameters: bbox (torch.Tensor) – Shape (n, 4) for bboxes.
Returns: Converted bboxes.
Return type: torch.Tensor

detrex.layers.generalized_box_iou(boxes1, boxes2) → Tensor[source]

Generalized IoU from https://giou.stanford.edu/

The input boxes should be in (x0, y0, x1, y1) format

Parameters

boxes1 – (torch.Tensor[N, 4]): first set of boxes
boxes2 – (torch.Tensor[M, 4]): second set of boxes

Returns

a NxM pairwise matrix containing the pairwise Generalized IoU for every element in boxes1 and boxes2.

Return type

torch.Tensor

detrex.layers.get_sine_pos_embed(pos_tensor: Tensor, num_pos_feats: int = 128, temperature: int = 10000, exchange_xy: bool = True) → Tensor[source]

generate sine position embedding from a position tensor

Parameters

pos_tensor (torch.Tensor) – Shape as (None, n).
num_pos_feats (int) – projected shape for each float in the tensor. Default: 128
temperature (int) – The temperature used for scaling the position embedding. Default: 10000.
exchange_xy (bool, optional) – exchange pos x and pos y. For example, input tensor is [x, y], the results will # noqa be [pos(y), pos(x)]. Defaults: True.

Returns

Returned position embedding # noqa with shape (None, n * num_pos_feats).

Return type

torch.Tensor

detrex.layers.masks_to_boxes(masks) → Tensor[source]

Compute the bounding boxes around the provided masks

The masks should be in format [N, H, W] where N is the number of masks, (H, W) are the spatial dimensions.

Returns: a [N, 4] tensor with the boxes in (x0, y0, x1, y1) format.
Return type: torch.Tensor