17/11/2023

Dev blog: The difficulty of working with many data sources

Development blog / Smári

Recently we’ve seen a lot of seismic activity in and around Grindavík, Iceland. Ecosophy’s offices are about 40km away, and while we’re not directly affected, we are obviously deeply aware of the events unfolding nearby and eager to see how we might be able to help in communicating available information. The situation is dire: Grindavík has been evacuated, parts of the town have sunk by over a meter, while other areas have lifted up. A magma channel is running through the town and is estimated as being around 15km long. There’s a possibility that things will settle down, but there’s also a chance of a volcanic eruption centered anywhere along the channel, potentially endangering the town. We hope for the best, but we also need to prepare for the worst.

There are several types of Earth Observation Data that are relevant in this kind of situation.

Earthquake data based on triangulation of data from seismographs is the most obvious type. This kind of data is available from a variety of sources, mostly national meteorological or geological offices. The data comes in a ridiculous variety of formats, with a wide range of different details, making the act of combining data from different sources a maddening task. For some reason there doesn’t seem to be any global earthquake data firehose available.

Less obvious but highly relevant is topographic interferometry data. This data is collected by satellites in Earth orbit that have a Synthetic Aperture Radar (SAR) pointed at the ground. There are numerous satellites that provide this kind of data, such as the European Space Agency’s Sentinel-1 mission, the German TerraSAR-X, the Finnish ICEye satellites, and JAXA’s ALOS-2. In each case, they will do scans of swaths of land as they orbit, typically between 3 to 100 km wide, and with resolutions down to 1 meter per pixel. The scans are in the form of radar signals, which are reflected from the ground, and based on the polarity and direction of the reflected beam, the satellite can determine the ground height. If you compare data from successive passes, you can see if the ground height has changed. This can indicate underground magma movement.

The satellites of course can only do swaths underneath their orbital path. For example, the Sentinel-1 satellites orbit at around 693km altitude, in a sun synchronous orbit, so that they’re always over any particular point on Earth’s surface at the same local solar time on each orbit. It takes them about 99 minutes to complete one orbit, so they make just under 15 passes of the Earth each day. Because this data comes in these swaths, where there’s a lot of preprocessing that needs to be done before the data can be used effectively. This includes calibration adjustments, noise cleanup, and various other steps to make the data uniform.

Now, much like with the earthquake data, there is a staggering range of different formats used for this data, with very different levels of complexity. Most of them seem to be based around a very arbitrary set of design decisions that, frankly, make very little sense. But worse still is the fact that the data is difficult to get at. Obtaining Sentinel-1 data requires diving through pages upon pages of documentation with links pointing back and forth between different versions of the documentation. When at last you get to the bottom of that rabbit hole, you’re faced with numerous different APIs providing slightly different views on the data. Most of the time, instead of querying data based on the area of land you’re interested in, you do it based on the relative orbit number of the satellite. Because, as everybody knows, that is the way most people will think about things.

Obtaining ALOS-2 data is significantly easier, as using JAXA’s G-Portal only requires registering and getting easy access to their system. They even have much of their data generally available through an FTP server ─ deliciously oldschool. Serious kudos to the Japanese for making this easy. However, the data they have available is quite limited, and doesn’t cover Iceland recently.

There is a perspective the data represented here is significantly large and complex. But in reality, it’s only a few thousand entries for any given month per satellite instrument. Keeping things well organized and easily accessible shouldn’t be an insurmountable task, and the fact that it’s really difficult is an impediment to more common usage of this data. I’m sure that even experts who use this data on a regular basis should be able to agree that the current methods aren’t exactly elegant.

At any rate, it turns out that the way to obtain this kind of data is to know the right people in the right organizations, and ideally have a reasonably big budget for acquiring the data from the commercial vendors. Having done so, you can then deal with the weird projections and data normalization issues and maybe, just maybe, end up with something usable.

There are of course various other types of data that would be helpful. Atmospheric SO2 measurements, soil temperature, and so on. Some of this data is available, but generally not in sufficient fidelity to be useful. Either way, at this point, we’d settle for a reliable stream of earthquake data.

For now, it looks like we’re going to have to hold off on getting proper SAR-data feeds, but this is something we’d like to revisit in the future when we have more time to do a deep dive into the jungle of APIs and data formats available for InSAR data.