Building footprint extraction in a dense area with MaskRCNN — Jakarta, Indonesia

6 min readSep 10, 2020

Building footprint is a good indicator to see people density across an area.

Background story

If you own a telecommunication company, and you’re trying to find areas to promote your products, which area suites the best? Surely this is a market sizing problem. Having a customer’s location data, you can map the spread of your customers. But how do you know if there are, still, potential customers in that area?

To do this, commonly people use data provided by the Central Bureau of Statistics (BPS) in a specific area like Kabupaten/Kota (district) or Kecamatan (sub-district) to know whether there are still potential customers.

asd — Central Bureau of Statistics in Indonesia [2]

For example: If we have 1000 customers in district A, and we have additional information that there are 6000 people live in district A from BPS, you can approximate the number of potential customers by doing simple math. Assume one house equals to one customer. In Indonesia, it’s common that one house consists of 4 people or more. By dividing 6000 by 4 we get 1500 which means there are 500 potential customers that have not subscribed to our products.

Building footprint extraction

As the prediction above still too rough, we can find a better estimation by using building footprints. A building footprint is simply a polygon that surrounds a building. To have building footprints we need to digitize a polygon over building from satellite imagery and to repeat the process for all visible buildings in the target areas. As a result, we know the number of houses in a specific area. Frankly speaking, this consumes a lot of time. A normal human could only digitize a coverage area around 0.15 squared kilometers in 3 hours.

Building footprint is a polygon that surrounds the building.

Mask R-CNN

Based on the fact that manual digitation consumes a lot of time, in this article, we propose a way that will boost up the digitation process. We will learn how deep learning models can help to solve this problem with the Mask R-CNN model.

Actually, it is common to use object detection algorithms like Single Shot Detection (SSD) to detect building footprint. But it has disadvantages in a dense area like Jakarta, as there’s a probability that some buildings could be detected as one object. Surely this will hamper the accuracy. So, in this article, we propose the Mask R-CNN algorithm.

Mask R-CNN is a state of the art model for instance segmentation. This model is built on top of Faster R-CNN, which is a region-based CNN. As the output, it returns bounding boxes for each detected object, class label, and confidence score. This algorithm is proven could distinguish objects that are attached to each other. [4]

We will implement the algorithm with arcgis.learn Python library and ArcGIS Pro.

Step 1: Create training Sample
Creating training samples means we need to manually digitize the building footprints in a certain area. As an addition, we need to add a column (classvalue column in Fig.1) that indicates the class of the building footprint. In this case, we have one class that represents a building footprint, and we give value 1. The area size of the ground truth is known 0.3 km squared kilometers.

Step 2: Export training data
We will use the World Imagery template from Esri to generate the training data. With the help of the “Export Training Data for Deep Learning” geoprocessing tool, we will have the following items inside the destination folder:

images -> a collection of image chips from raster which is cut as tile size XY in the geoprocessing tool
labels -> a collection of the image chips’ label from raster which is cut as tile size XY in the geoprocessing tool
esri_accumulated_stats.json
esri_model_definition.emd
map.txt
stats.txt.
models →this is not the part of this step and will be explained in step 3.

Geoprocessing tool with its destination folder

Step 3: Train model
Afterward, we will train a model with generated images and labels from the previous step. Set the chip_size the same size as the tile size. To reduce the amount of VRAM consumed by the GPU, tune the size of the batch_size until we get the optimal one. If the batch_size set too high the process won’t work as it will ask for a number of VRAM that cannot be provided by the GPU.*

*We use RTX2080Ti, it has 12Gb VRAM.

For the library used, we will use arcgis Python library. This is an open-source library managed by Esri that is intended to manage/automate the ArcGIS workflow with Python. With the additional deep learning model, we can train our deep learning model here. This library has some advantages as the way we use it is very simple. The code below shows how we can train a MaskRCNN model within 8 steps.

Python code for training Mask R-CNN model (1)

Python code for training Mask-RCNN model (2)

The model that has been build from the Python code above will be saved on the folder named ‘models’ in the step 2 destination folder. From the figure below, E40_Jakarta_World_Imagery.pth indicates that the framework used by arcgis Python library is PyTorch. To see more detail of the model, we can open E40_Jakarta_World_Imagery.emd. If we open it with Notepad++, it is clear that the backbone used is ResNet50, and the optimal learning rate for training the model is automatically determined by the library.

Step 4: Detect object deep learning
Since we get the model, with the help of “Detect Objects Using Deep Learning” geoprocessing tools, the model could be tested to the imagery data. In this case, the densely populated area is the target area. From the figure below, we can see that our model successfully detects the building footprint across Jakarta. The time needed to detect the whole area is 1.5 minutes.

Step 5: Accuracy Assessment
To get the accuracy we can simply count the number of the ground truth by dividing the number of detected building footprints with its ground truth. From the figure below, the number of ground truth is 925 and the number of detected building footprints is 689, so the accuracy is 689/925 * 100% = 76.13%.

Detection Results — Video

Conclusion
Finishing step 5 means we have learned how to build a deep learning model with Mask R-CNN algorithm, to test the model, and to assess the result. The accuracy result is quite satisfying for imagery with a 30 cm resolution.

With the aforementioned steps above, we learn how deep learning could help companies to automatically digitize building footprints. This algorithm converts digitizing time from 6 hours to 1.5 minutes. Surely, this is really a great improvement in term of time optimization. This will help them to do market sizing by counting the number of building in a specific area.

Hopefully, this article could broaden our eyes about the capability of ArcGIS Pro and its arcgis Python library to help the non-IT backgrounds to build their own deep learning model.

References

[1]https://www.caliper.com/graphics/maptitude-sample-market-potential-map.jpg
[2]https://cdn.ayobandung.com/images-bandung/post/articles/2019/02/12/44916/logo-bps-bps.go.id_ratio-16x9.jpg
[3] https://buildingfootprintusa.com/wp-content/uploads/2019/09/Canada_Montreal.jpg
[4] https://developers.arcgis.com/python/guide/how-maskrcnn-works/
[5] https://arxiv.org/pdf/1703.06870.pdf
[6]https://developers.arcgis.com/python/api-reference/arcgis.learn.html?highlight=unet#

Building footprint extraction in a dense area with MaskRCNN — Jakarta, Indonesia

Written by Lucas Suryana