Overcoming data processing overload in scientific web mapping software
Processing large spatial data sets stretched the limits of Manaaki Whenua - Landcare Research’s online web mapping software.
Working with NeSI research software engineers to use virtual containers for lighter, more versatile data processing.
Manaaki Whenua’s web mapping tools infrastructure is now more efficient and able to work with more data than in the past.
Manaaki Whenua - Landcare Research provides a set of online mapping tools to give users information on New Zealand’s land environment. They are important tools for government, business, scientists, and the public. The tools provide information on soil, land resources, land cover and environmental protection.
Michael Speth is a DevOps engineer for Manaaki Whenua - Landcare Research. He supports scientists and application developers in the provision of the infrastructure that hosts the tools. This infrastructure gives end users fast, responsive information about New Zealand’s land.
“Our end users have access to S-Map Online, that’s our soil mapping site with 17,000 registered users. We have another site called OurEnvironment that is a land atlas of New Zealand which was used 22,000 times in 2010. We’re rebuilding other tools and creating new ones,” Michael said.
The maps displayed in the different tools are created from geospatial databases, split into map tiles and stored as massive map tile caches totalling 3.4 Terabytes of data. The efficient assembly of these tiles has proved a challenge for Manaaki Whenua. The multi-resolution tiles are created from large, complex data sets.
“We use Apache server and Mapserver’s Mapcache module as the core map technology. When you view our maps, the web server doesn’t generate new map tiles on the fly. It looks up where the tile is and retrieves it from the cache which is much faster,” Michael said.
But creating map tiles this way led to memory and processing capacity problems as the work load increased. More data layers have been added to the tools, increasing the need for updates and tile cache creation.
“How do we speed up generating the map caches? Traditionally, we were running this on a single virtual machine and adding more resources to it. To create the map caches for one of our tools would take about a month of continuous processing,” Michael said.
Michael realised this method wouldn’t provide enough memory to continue adding new data sets to the mapping tools. So the team turned to NeSI's Consultancy Service for guidance.
“There’s only so much CPU and memory you can assign to a virtual machine. We went to NeSI to see if it was possible to parallelise the data processing on the cluster. We wanted to take advantage of Mahuika’s 8000 CPUs and Terabytes of memory. We wanted to take the cache generation we had deployed on the virtual machine and mesh that with NeSI,” Michael said.
Michael was using virtual machines. Virtual machines are software that run their own operating system, mimicking a real hardware machine. It seemed a natural step to harness NeSI processing power to speed up the compiling process. But this meant Michael and his colleagues needed to recompile all the existing mapping software used to generate the maps and tiles. This was an inefficient way to use the software on the cluster.
Instead Michael worked with Wolfgang Hayek, one of the NeSI team’s research software engineers, to find a better solution. Wolfgang and NeSI Solutions Manager Blair Bethwaite discussed options, and the idea of containers surfaced.
Containers are like Virtual Machines except they are much lighter weight and share the core of the operating system on the host computer. Also, as the container bundles up all software needed for the processing workflow, parallelisation can be easily achieved by running multiple containers concurrently on subsets of the data.
“Our orchestration tool already installs into a VM, so we thought we’d try and get it installed into a container. The process for that was installing Docker, then converting the components from Docker to Singularity,” Michael said.
NeSI was already using the Singularity containerization software, which meant they were familiar with its use. This proved to be a big help to Michael, who had not used it previously. He was able to work with NeSI team members to troubleshoot his problems.
“My first attempt at Singularity failed. There were several problems, partly due to my ignorance of Singularity. Wolfgang looked at what I had done and was able to convert it to a workable solution. He had wonderful patience and provided a lot of explanation about the interworking of Singularity,” Michael said.
Michael took the containerization knowledge learned from NeSI and applied it to Manaaki Whenua’s problem. New map caches are now created much faster and more efficiently: creating soil map caches for S-map Online takes three hours when previously it took eight days. Staff are now looking to use this pipeline to create map caches for datasets that are updated weekly. This would not be possible using the old map cache creation pipeline and the assistance of NeSI expertise.
For more information on this project, the below video includes a presentation by Michael as part of a ligthning talk session at FOSS4G SotM Oceania 2019, organised by OSGeo Oceania and held at The National Library in Wellington, New Zealand from November 12-15 2019.
Do you have an example of how NeSI support or platforms have supported your work? We’re always looking for projects to feature as a case study. Get in touch by emailing firstname.lastname@example.org.