Coursera Learner working on a presentation with Coursera logo and
Coursera Learner working on a presentation with Coursera logo and

Mask R-CNN for Ship Detection & Segmentation

One of the foremost exciting applications of deep learning is that the ability for machines to know images. Fei-Fei Li has mentioned this as giving machines the “ability to see”. There are four main classes of problems in detection and segmentation as described within the image (a) below .*8Nwk_IdGpe235Nsfewpucg.png

Mask R-CNN

Mask R-CNN is an extension over Faster R-CNN. Faster R-CNN predicts bounding boxes and Mask R-CNN essentially adds another branch for predicting an object mask in parallel.*zfUPBhMG9L_XlSM8C1PqdQ.png*JzFsV3nJPhTpDm05KaCrxQ.png

I’m not getting to enter detail on how Mask R-CNN works but here are the overall steps the approach follows:

Backbone model: a typical convolutional neural network that is a feature extractor. for instance , it’ll turn a1024x1024x3 image into a 32x32x2048 feature map that is input for subsequent layers.

Region Proposal Network (RPN): Using regions defined with as many as 200K anchor boxes, the RPN scans each region and predicts whether or not an object is present. one among the good advantages of the RPN is that doesn’t scan the particular image, the network scans the feature map, making it much faster.

Region of Interest Classification and Bounding Box: during this step the algorithm takes the regions of interest proposed by the RPN as inputs and outputs a classification (softmax) and a bounding box (regressor).

Segmentation Masks: within the final step, the algorithm the positive ROI regions are taken in as inputs and 28×28 pixel masks with float values are generated as outputs for the objects. During inference, these masks are scaled up.

Training and Inference with Mask R-CNN

Instead of replicating the whole algorithm supported the research paper, we’ll use the awesome Mask R-CNN library that Matterport built. We’ll need to A) generate our train and dev sets, B) do some wrangling to load the into the library, C) setup our training environment in AWS for training, D) use transfer learning to start out training from the coco pre-trained weights, and E) tune our model to urge good results.

Step 1: Download Kaggle Data and Generate Train and Dev Splits

The dataset provided by Kaggle consists of many thousands of images therefore the easiest thing is to download them on to the AWS machine where we’ll be doing our training. Once we download them, we’ll need to split them into train and dev sets, which can be done randomly through a python script.

I highly recommend employing a spot instance to download the info from Kaggle using Kaggle’s API and upload that zipped data into an S3 bucket. You’ll later download that data from S3 and unzip it at training time.

Kaggle provides a csv file called train_ship_segmentations.csv with two columns: ImageId and EncodedPixels (run length encoding format). Assuming we’ve downloaded the pictures into the ./datasets/train_val/ path we will split and move the pictures into train and dev set folders with this code: train_ship_segmentations_df = pd.read_csv(os.path.join(“./datasets/train_val/train_ship_segmentations.csv”))

msk = np.random.rand(len(train_ship_segmentations_df)) < 0.8

train = train_ship_segmentations_df[msk]

test = train_ship_segmentations_df[~msk]

#  Move train set

for index, row in train.iterrows():

    image_id = row[“ImageId”]

    old_path = “./datasets/train_val/{}”.format(image_id)

    new_path = “./datasets/train/{}”.format(image_id)

    if os.path.isfile(old_path):

        os.rename(old_path, new_path)

# Move dev set

for index, row in test.iterrows():

    image_id = row[“ImageId”]

    old_path = “./datasets/train_val/{}”.format(image_id)

    new_path = “./datasets/val/{}”.format(image_id)

    if os.path.isfile(old_path):

        os.rename(old_path, new_path)

Step 2: Load data into the library

There is a selected convention the Mask R-CNN library follows for loading datasets. we’d like to make a category ShipDataset which will implement the most functions required:

class ShipDataset(utils.Dataset):

   def load_ship(self, dataset_dir, subset):

   def load_mask(self, image_id):

   def image_reference(self, image_id):

To convert a Run Length Encoded Mask to a picture mask (boolean tensor) we use this function below rle_decode. this is often wont to generate the bottom truth masks that we load into the library for training in our ShipDataset class.

class ShipDataset(utils.Dataset):

   def load_ship(self, dataset_dir, subset):

   def load_mask(self, image_id):

   def image_reference(self, image_id):

Step 3: Setup Training with P3 Spot Instances and AWS Batch

Given the massive dataset we would like to coach with, we’ll got to use AWS GPU instances to urge good leads to a practical amount of your time . P3 instances are quite expensive, but you using Spot Instances you’ll get a p3.2xlarge for around $0.9 / hr which represents about 70% savings. The key here is to be efficient and automate the maximum amount as we will so as to not waste any time/money in non-training tasks like fixing the info , etc. to try to to that, we’ll use shell scripts and docker containers, then use the awesome AWS Batch service to schedule our training.

The first thing I did is create a Deep Learning AMI configured for AWS Batch that uses nvidia-docker following this AWS Guide. The AMI ID is ami-073682d8e65240b76 and it’s hospitable the community. this may allow us to coach using docker containers with GPUs.

Next is creating a dockerfile that has all of the dependencies we’d like also because the shell scripts which will lookout of downloading the info and run training.Note the last three shell scripts copied into the container: FROM nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04

MAINTAINER Gabriel Garza <>

# Essentials: developer tools, build tools, OpenBLAS

RUN apt-get update && apt-get install -y –no-install-recommends \

    apt-utils git curl vim unzip openssh-client wget \

    build-essential cmake \



# Python 3.5


# For convenience, alias (but don’t sym-link) python & pip to python3 & pip3 as recommended in:


RUN apt-get install -y –no-install-recommends python3.5 python3.5-dev python3-pip python3-tk && \

    pip3 install pip==9.0.3 –upgrade && \

    pip3 install –no-cache-dir –upgrade setuptools && \

    echo “alias python=’python3′” >> /root/.bash_aliases && \

    echo “alias pip=’pip3′” >> /root/.bash_aliases

# Pillow and it’s dependencies

RUN apt-get install -y –no-install-recommends libjpeg-dev zlib1g-dev && \

    pip3 –no-cache-dir install Pillow

# Science libraries and other common packages

RUN pip3 –no-cache-dir install \

    numpy scipy sklearn scikit-image==0.13.1 pandas matplotlib Cython requests pandas imgaug

# Install AWS CLI

RUN pip3 –no-cache-dir install awscli –upgrade


# Jupyter Notebook


# Allow access from outside the container, and skip trying to open a browser.

# NOTE: disable authentication token for convenience. DON’T DO THIS ON A PUBLIC SERVER.

RUN pip3 –no-cache-dir install jupyter && \

    mkdir /root/.jupyter && \

    echo “c.NotebookApp.ip = ‘*'” \

         “\nc.NotebookApp.open_browser = False” \

         “\nc.NotebookApp.token = ”” \

         > /root/.jupyter/



# Tensorflow 1.6.0 – GPU


# Install TensorFlow

RUN pip3 –no-cache-dir install tensorflow-gpu

# Expose port for TensorBoard



# OpenCV 3.4.1


# Dependencies

RUN apt-get install -y –no-install-recommends \

    libjpeg8-dev libtiff5-dev libjasper-dev libpng12-dev \

    libavcodec-dev libavformat-dev libswscale-dev libv4l-dev libgtk2.0-dev \

    liblapacke-dev checkinstall

RUN pip3 install opencv-python


# Keras 2.1.5


RUN pip3 install –no-cache-dir –upgrade h5py pydot_ng keras


# PyCocoTools


# Using a fork of the original that has a fix for Python 3.

# I submitted a PR to the original repo (

# but it doesn’t seem to be active anymore.

RUN pip3 install –no-cache-dir git+

COPY /home

COPY /home

COPY /home

WORKDIR “/home” -> clones our Mask R-CNN repo, downloads and unzips our data from S3, splits the info into train and dev sets, downloads the newest weights we’ve saved in S3 -> loads latest weights, runs the train command python3 ./ train –dataset=./datasets –weights=last, uploads trained weights to S3 after training ends -> download the Kaggle Challenge test dataset (which is employed to submit your entry to the challenge), generates predictions for every of the pictures , converts masks to run length encoding, and uploads the predictions CSV file to S3.

Step 3: Train the model using AWS Batch

The beauty of AWS Batch is that you simply can create a compute environment that uses a Spot Instance and it’ll run employment using your docker container, then terminate your Spot Instance as soon as your job ends.

I won’t enter great detail here (might make this another post), but essentially you build your image, upload it into AWS ECR, then in AWS Batch you schedule your training or inference job to run with command bash or bash and await it to end (you can follow the progress by watching the logs in AWS Watch). The resulting files (trained weights or predictions csv) are uploaded to S3 by our script.

The first time we train, we pass within the coco argument (in order to use Transfer Learning and train our model on top of the already trained coco dataset:

python3 ./ train –dataset=./datasets –weights=coco

Once we’ve finish our initial training run we’ll pass within the last argument to the train command so we start training where we left off:

python3 ./ train –dataset=./datasets –weights=last

We can tune our model using the ShipConfig class and overwriting the default settings. Setting Non-Max Suppression to 0 was important to urge obviate predicting overlapping ship masks (which the Kaggle challenge doesn’t allow). class ShipConfig(Config):

    “””Configuration for training on the toy  dataset.

    Derives from the base Config class and overrides some values.


    # Give the configuration a recognizable name

    NAME = “ship”

# We use a GPU with 12GB memory, which can fit two images.

    # Adjust down if you use a smaller GPU.


# Number of classes (including background)

    NUM_CLASSES = 1 + 1  # Background + ship

# Number of training steps per epoch


# Skip detections with < 95% confidence


# Non-maximum suppression threshold for detection



    IMAGE_MAX_DIM = 768

Step 4: Predict ship segmentations

To generate our predictions, all we’ve to try to to is run our container in AWS Batch with the bash command. this may use the script inside, here’s a snippet of what inference looks like:

class InferenceConfig(config.__class__):

        # Run detection on one image at a time

        GPU_COUNT = 1

        IMAGES_PER_GPU = 1



        IMAGE_MIN_DIM = 768

        IMAGE_MAX_DIM = 768

        RPN_ANCHOR_SCALES = (64, 96, 128, 256, 512)


# Create model object in inference mode.

config = InferenceConfig()

model = modellib.MaskRCNN(mode=”inference”, model_dir=MODEL_DIR, config=config)

# Instantiate dataset

dataset = ship.ShipDataset()

# Load weights

model.load_weights(os.path.join(ROOT_DIR, SHIP_WEIGHTS_PATH), by_name=True)

class_names = [‘BG’, ‘ship’]

# Run detection

# Load image ids (filenames) and run length encoded pixels

images_path = “datasets/test”

sample_sub_csv = “sample_submission.csv”

# images_path = “datasets/val”

# sample_sub_csv = “val_ship_segmentations.csv”

sample_submission_df = pd.read_csv(os.path.join(images_path,sample_sub_csv))

unique_image_ids = sample_submission_df.ImageId.unique()

out_pred_rows = []

count = 0

for image_id in unique_image_ids:

image_path = os.path.join(images_path, image_id)

if os.path.isfile(image_path):

    count += 1

    print(“Step: “, count)

    # Start counting prediction time

    tic = time.clock()

    image =

    results = model.detect([image], verbose=1)

    r = results[0]

    # First Image

    re_encoded_to_rle_list = []

    for i in np.arange(np.array(r[‘masks’]).shape[-1]):

        boolean_mask = r[‘masks’][:,:,i]

        re_encoded_to_rle = dataset.rle_encode(boolean_mask)


    if len(re_encoded_to_rle_list) == 0:

        out_pred_rows += [{‘ImageId’: image_id, ‘EncodedPixels’: None}]

        print(“Found Ship: “, “NO”)


        for rle_mask in re_encoded_to_rle_list:

            out_pred_rows += [{‘ImageId’: image_id, ‘EncodedPixels’: rle_mask}]

            print(“Found Ship: “, rle_mask)

    toc = time.clock()

    print(“Prediction time: “,toc-tic)


submission_df = pd.DataFrame(out_pred_rows)[[‘ImageId’, ‘EncodedPixels’]]

filename = “{}{:%Y%m%dT%H%M}.csv”.format(“./submissions/submission_”,

submission_df.to_csv(filename, index=False)

I saw several challenging cases, like waves and clouds within the images, which the model initially thought were ships. to beat this challenge, I modified the region proposal network’s anchor box sizes RPN_ANCHOR_SCALES to be smaller, this dramatically improved results because the model not predicted small waves to be ships.


You can get decent results after about 30 epochs (defined in I trained for 160 epochs and was ready to get to 80.5% accuracy in my Kaggle submission.

I’ve included a Jupyter Notebook called inspect_shyp_model.ipynb that permits you to run the model and make predictions on any image locally on your computer.


Weekly newsletter

No spam. Just the latest releases and tips, interesting articles, and exclusive interviews in your inbox every week.