Caching and local datasets#
Datasets accessed via the intake catalog can be either downloaded into a temporary folder, from where they will be deleted after restarting python, or permanently into a specified directory. If the dataset is already contained within the specified directory, intake will load the data from the local source, instead of downloading it again from the remote server. This is recommended for large datasets or datasets which are used regularily.
The following example shows, how to supply local directories to intake using the simplecache
functionality. Directories of many local datasets are suggested to be stored in a single .yaml
file to avoid the specification of local directories within the python routines.
Loading the intake catalog#
We load the intake catalog from ac3airborne, which contains paths to the remote servers, where files are stored.
import ac3airborne
cat = ac3airborne.get_intake_catalog()
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
Cell In[2], line 1
----> 1 cat = ac3airborne.get_intake_catalog()
File ~/miniconda3/envs/how_to_ac3airborne/lib/python3.12/site-packages/ac3airborne/__init__.py:60, in get_intake_catalog()
53 def get_intake_catalog():
54 """
55 Open the intake data catalog.
56
57 The catalog provides access to public AC3 airborne datasets without the need to
58 manually specify URLs to the individual datasets.
59 """
---> 60 return intake.open_catalog("https://raw.githubusercontent.com/igmk/ac3airborne-intake/main/catalog.yaml")
File ~/miniconda3/envs/how_to_ac3airborne/lib/python3.12/site-packages/intake/__init__.py:184, in open_catalog(uri, **kwargs)
177 raise ValueError(
178 f"Unknown catalog driver '{driver}'. "
179 "Do you need to install a new driver from the plugin directory? "
180 "https://intake.readthedocs.io/en/latest/plugin-directory.html\n"
181 f"Current registry: {list(sorted(registry))}"
182 )
183 try:
--> 184 return registry[driver](uri, **kwargs)
185 except VersionError:
186 # warn that we are switching to V2? The file will be read twice
187 return from_yaml_file(uri, **kwargs)
File ~/miniconda3/envs/how_to_ac3airborne/lib/python3.12/site-packages/intake/catalog/local.py:613, in YAMLFileCatalog.__init__(self, path, text, autoreload, **kwargs)
611 self.filesystem = kwargs.pop("fs", None)
612 self.access = "name" not in kwargs
--> 613 super(YAMLFileCatalog, self).__init__(**kwargs)
File ~/miniconda3/envs/how_to_ac3airborne/lib/python3.12/site-packages/intake/catalog/base.py:128, in Catalog.__init__(self, entries, name, description, metadata, ttl, getenv, getshell, persist_mode, storage_options, user_parameters)
126 self.updated = time.time()
127 self._entries = entries if entries is not None else self._make_entries_container()
--> 128 self.force_reload()
File ~/miniconda3/envs/how_to_ac3airborne/lib/python3.12/site-packages/intake/catalog/base.py:186, in Catalog.force_reload(self)
184 """Imperative reload data now"""
185 self.updated = time.time()
--> 186 self._load()
File ~/miniconda3/envs/how_to_ac3airborne/lib/python3.12/site-packages/intake/catalog/local.py:648, in YAMLFileCatalog._load(self, reload)
646 logger.warning("Use of '!template' deprecated - fixing")
647 text = text.replace("!template ", "")
--> 648 self.parse(text)
File ~/miniconda3/envs/how_to_ac3airborne/lib/python3.12/site-packages/intake/catalog/local.py:728, in YAMLFileCatalog.parse(self, text)
726 result = CatalogParser(data, context=context, getenv=self.getenv, getshell=self.getshell)
727 if result.errors:
--> 728 raise exceptions.ValidationError(
729 "Catalog '{}' has validation errors:\n\n{}"
730 "".format(self.path, "\n".join(result.errors)),
731 result.errors,
732 )
734 cfg = result.data
736 self._entries = {}
ValidationError: Catalog 'https://raw.githubusercontent.com/igmk/ac3airborne-intake/main/catalog.yaml' has validation errors:
("missing 'module'", {'module': 'intake_xarray'})
Additionally we lead the flight-phase-seperation, which contains information on every research flight.
meta = ac3airborne.get_flight_segments()
Example: Dropsonde data#
The caching functionality will be demonstrated with the dropsonde data published online on the PANGAEA data base. The file of the dataset on PANGAEA is contained in the intake catalog.
Option 1: Download into temporary folder#
The download into the temporary folder is the default behaviour. Usually the dataset is stored in the /tmp
directory. We will download the first dropsonde of ACLOUD_P5_RF23
. The parameter i_sonde
describes the dropsonde number during the research flight.
ds_dsd = cat['ACLOUD']['P5']['DROPSONDES']['ACLOUD_P5_RF23'](i_sonde=1).to_dask()
Invalid MIT-MAGIC-COOKIE-1 key
By default, the variable name is not readable. By setting the parameter same_names
of the simplecache
group to True and supplying it to the storage_options
parameter, the downloaded file has the same file name as the remote file (i.e. the file on PANGAEA).
kwds = {'simplecache': dict(
same_names=True
)}
ds_dsd = cat['ACLOUD']['P5']['DROPSONDES']['ACLOUD_P5_RF23'](i_sonde=1, storage_options=kwds).to_dask()
Option 2: Permanent download into local directory#
Under the storage_options
parameter, we can also specify the local directory of the dataset. The path will be supplied to the same_names
parameter of the simplecache
group as shown below. If the remote file is contained in the local directory, the local file will be read. Else, the remote file will be downloaded and stored at the specified location permanently. The next time, the data is imported, intake will use the local file.
Here, we will store the data relative to the current working directory at ./data/dropsondes
.
kwds = {'simplecache': dict(
cache_storage='./data/dropsondes',
same_names=True
)}
ds_dsd = cat['ACLOUD']['P5']['DROPSONDES']['ACLOUD_P5_RF23'](i_sonde=1, storage_options=kwds).to_dask()
Managing directories of multiple datasets#
Datasets may be stored locally in different directories. Instead of specifying the directory in every python script, we can use one file, where all paths are stored for each dataset. Here, we will use a yaml
file, as it can be read easily into a python dictionary.
The file may be structured like this:
DROPSONDES: '/local/path/to/dropsondes'
BROADBAND_IRRADIANCE: '/local/path/to/broadband_irradiance'
FISH_EYE: '/local/path/to/fish_eye'
In the following, the data will be downloaded in or used from the local ./data folder of the current working directory.
import yaml
Now we read the pre-defined yaml file
with open('./local_datasets.yaml', 'r') as f:
local_dir = yaml.safe_load(f)
print(local_dir)
{'DROPSONDES': './data/dropsondes', 'BROADBAND_IRRADIANCE': './data/broadband_irradiance', 'FISH_EYE': './data/fish_eye'}
As a test, we will download the dropsonde data from ACLOUD RF05.
dataset = 'DROPSONDES'
flight_id = 'ACLOUD_P5_RF05'
We can access the directory, where the data is stored using the dataset name.
print(local_dir[dataset])
./data/dropsondes
We add the path now to the storage_options
parameter.
kwds = {'simplecache': dict(
cache_storage=local_dir[dataset],
same_names=True
)}
Now we download or use the local dropsonde file. Afterwards, check if the directory ./data/dropsondes
has been created and contains the file DS_ACLOUD_Flight_05_20170525_V2.nc
. If the directory and the file already exist, the local file will be read.
ds = cat['ACLOUD']['P5'][dataset][flight_id](i_sonde=1, storage_options=kwds).to_dask()
Get data for offline usage#
The following example presents a way to download all the data of a certain instrument for all research flights for offline usage. At first, we get all the flights, for which data of the instrument is available.
dataset = 'DROPSONDES'
flight_ids = []
for campaign in ['ACLOUD', 'AFLUX', 'MOSAiC-ACA']:
flight_ids.extend(list(cat[campaign]['P5']['MiRAC-A']))
print(flight_ids)
['ACLOUD_P5_RF04', 'ACLOUD_P5_RF05', 'ACLOUD_P5_RF06', 'ACLOUD_P5_RF07', 'ACLOUD_P5_RF08', 'ACLOUD_P5_RF10', 'ACLOUD_P5_RF11', 'ACLOUD_P5_RF15', 'ACLOUD_P5_RF17', 'ACLOUD_P5_RF18', 'ACLOUD_P5_RF19', 'ACLOUD_P5_RF20', 'ACLOUD_P5_RF21', 'ACLOUD_P5_RF22', 'ACLOUD_P5_RF23', 'ACLOUD_P5_RF25', 'AFLUX_P5_RF03', 'AFLUX_P5_RF04', 'AFLUX_P5_RF05', 'AFLUX_P5_RF06', 'AFLUX_P5_RF07', 'AFLUX_P5_RF08', 'AFLUX_P5_RF09', 'AFLUX_P5_RF10', 'AFLUX_P5_RF11', 'AFLUX_P5_RF12', 'AFLUX_P5_RF13', 'AFLUX_P5_RF14', 'AFLUX_P5_RF15', 'MOSAiC-ACA_P5_RF05', 'MOSAiC-ACA_P5_RF06', 'MOSAiC-ACA_P5_RF07', 'MOSAiC-ACA_P5_RF08', 'MOSAiC-ACA_P5_RF09', 'MOSAiC-ACA_P5_RF10', 'MOSAiC-ACA_P5_RF11']
Now we simply loop over all the flights. In the case of the dropsondes, we set the i_sonde
parameter to 1, since the dropsondes of the flights are contained all in the same file. This file will be downloaded, if it is not already contained in the directories. Dropsonde datasets are stored in seperate folders for every campaign. The paths, to where the data is downloaded is written into an empty yaml file.
list(cat['ACLOUD']['P5'][dataset])
['ACLOUD_P5_RF05',
'ACLOUD_P5_RF06',
'ACLOUD_P5_RF07',
'ACLOUD_P5_RF10',
'ACLOUD_P5_RF11',
'ACLOUD_P5_RF13',
'ACLOUD_P5_RF14',
'ACLOUD_P5_RF16',
'ACLOUD_P5_RF17',
'ACLOUD_P5_RF18',
'ACLOUD_P5_RF19',
'ACLOUD_P5_RF20',
'ACLOUD_P5_RF21',
'ACLOUD_P5_RF22',
'ACLOUD_P5_RF23']
dct = {dataset: {}}
for flight_id in list(cat['ACLOUD']['P5'][dataset]):
# get mission from flight_id segmentation
mission = flight_id.split('_')[0]
# define the path, to where data should be downloaded
path = f'./data/{dataset.lower()}/{mission.lower()}'
# store the path in a dictionary
dct[dataset].update({mission: path})
# define the parameters for caching
kwds = {'simplecache': dict(
cache_storage=path,
same_names=True
)}
# read the data to store it
cat['ACLOUD']['P5'][dataset][flight_id](i_sonde=1, storage_options=kwds).to_dask()
# keep track of paths, were data is downloaded to
with open('./local_datasets_2.yaml', 'w') as f:
yaml.dump(dct, f)