Tongue image quality assessment based on a deep convolutional neural network

Table 1 ResNet-152 for tongue image quality control

Layers	Feature map size	Structure
Conv1	200 × 200	7 × 7 conv, 64, stride 2
Conv2_x	100 × 100	3 × 3 max pool, stride 2
Conv2_x	100 × 100	\(\left[ {\begin{array}{*{20}l} {1 \times 1} \hfill & {{\text{conv}},} \hfill & {64} \hfill \\ {3 \times 3} \hfill & {{\text{conv}},} \hfill & {64} \hfill \\ {1 \times 1~} \hfill & {{\text{conv}},} \hfill & {256} \hfill \\ \end{array} } \right] \times 3\)
Conv3_x	50 × 50	\(\left[ {\begin{array}{*{20}l} {1 \times 1} \hfill & {{\text{conv}},} \hfill & {{\text{128}}} \hfill \\ {3 \times 3} \hfill & {{\text{conv}},} \hfill & {{\text{128}}} \hfill \\ {1 \times 1~} \hfill & {{\text{conv}},} \hfill & {{\text{512}}} \hfill \\ \end{array} } \right] \times 8\)
Conv4_x	25 × 25	\(\left[ {\begin{array}{*{20}l} {1 \times 1} \hfill & {{\text{conv}},} \hfill & {{\text{256}}} \hfill \\ {3 \times 3} \hfill & {{\text{conv}},} \hfill & {{\text{256}}} \hfill \\ {1 \times 1~} \hfill & {{\text{conv}},} \hfill & {{\text{1024}}} \hfill \\ \end{array} } \right] \times 36\)
Conv5_x	13 × 13	\(\left[ {\begin{array}{*{20}l} {1 \times 1} \hfill & {{\text{conv}},} \hfill & {{\text{512}}} \hfill \\ {3 \times 3} \hfill & {{\text{conv}},} \hfill & {{\text{512}}} \hfill \\ {1 \times 1~} \hfill & {{\text{conv}},} \hfill & {{\text{2048}}} \hfill \\ \end{array} } \right] \times 3\)
Classification Layer	1 × 1	13 × 13 global average pool
		2208D fully connected layer with ReLU
		2D fully connected layer
		Softmax

Building blocks are shown in brackets, with the number of blocks stacked. Downsampling is performed by Conv3_1, Conv4_1, and Conv5_1 with a stride of 2

ISSN: 1472-6947