diff --git a/README.md b/README.md index b7fa297..81a13b8 100644 --- a/README.md +++ b/README.md @@ -30,22 +30,15 @@ running in Jupyter should open. That's it for the setup. ## The notebooks -* **Get Started.ipynb** -_All about Earthdata Login_ Run this to create a local token object. Must have a -an active EDL account. The token object will be used in other notebooks. +The overall organization is: -* **ECCOv4.ipynb** -_Uses pydap an xarray_ to aggregate an an entire collection of -the ECCOv4 simulation. It uses the _Common Metadata Repository_ (CMR) to find cloud OPeNDAP URLs -associated with the DOI of the data collection. +* **binder/*.ipynb**: These demonstrate basic and most performant access to OPeNDAP data via PyDAP. These cover + * How to find OPeNDAP URLS + * How to Authenticate + * How to subset by variable names, time range, and coordinate values. + * How to best stream OPeNDAP dap4 responses into local NetCDF4 files. -* **earthaccess.ipynb** -_Uses earthaccess to create a VirtualZarr from OPeNDAP DMR++ for Cloud -data_. earthaccess uses the _Common Metadata Repository_ (CMR) to query all dataset by their -short name. - -* **on-premOPeNDAP.ipynb** _Using pydap an xarray_ it creates a virtually aggregation of 100s of -OPeNDAP URLs hosted by NASA on premises on the OB.DAAC. +* **binder/Xarray/*.ipynb **: These tutorials demonstrate access to OPeNDAP data using Xarray and "PyDAP" as the backend engine. To get the most performant access requires extra tricks, and these tutorials cover what is needed to get close to performance access. ---- ## **Optional**: Running the notebooks locally @@ -77,7 +70,7 @@ conda env create -f binder/environment.yml` After a few minutes, activate the new environment ``` -conda activate Earthdata2025` +conda activate Earthdata2026` ``` * Step 3 Starting the local copy of the notebooks diff --git a/binder/Authenticate.ipynb b/binder/Authenticate.ipynb new file mode 100644 index 0000000..0036bee --- /dev/null +++ b/binder/Authenticate.ipynb @@ -0,0 +1,99 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "da20b7f1-ab3a-477c-a9bb-4c3f97c99afb", + "metadata": {}, + "source": [ + " **Earthdata Login authentication with earthaccess**\n", + " \n", + " \n", + "\"drawing\" \n", + " \n", + "\n", + "\n", + " **Requirements**\n", + "1. Valid EDL account. If not, got o [Login Page](https://urs.earthdata.nasa.gov/home) and set up a Username and Password.\n", + "\n", + " **Objectives**\n", + "- To demonstrate remote access via token to Earthdata.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "e477a20a-4586-405d-8dce-9e5cee6e0ab1", + "metadata": {}, + "outputs": [], + "source": [ + "import earthaccess" + ] + }, + { + "cell_type": "markdown", + "id": "5774c511-3c18-4741-970a-bf78682c25a6", + "metadata": {}, + "source": [ + " **EDL Authentication via earthaccess**\n", + "\n", + " You can authenticate via earthaccess as demonstrated below. You must have a valid EDL account. There are two strategies for authenticating with `earthaccess`:\n", + "\n", + "1. `strategy=\"interactive\"`. This will promt your edl `username-password`.\n", + "2. `strategy=\"netrc\"`. Use this if the notebook is running on an environment where a `.netrc` with your credentials is recoverable.\n", + "\n", + "Below the default will be `interactive`, assuming there is not .netrc credential stored in the current machine. Once authenticated, all EDL credentials will be injected into an existing local (discoverable) `.netrc` file, or create one if there is none yet." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42cf01d6-ebed-4028-8e77-63c8dfbdc40e", + "metadata": {}, + "outputs": [], + "source": [ + "auth = earthaccess.login(strategy=\"interactive\", persist=True) # you will be promted to add your EDL credentials\n", + "\n", + "# pass Token Authorization to a new Session.\n", + "my_session = auth.get_session()" + ] + }, + { + "cell_type": "markdown", + "id": "44bd8c9a-09a4-4187-a814-f371a02b885a", + "metadata": {}, + "source": [ + " The `my_session` object contains all valid credentials object and can be used by PyDAP to send data requests to the remote NASA OPeNDAP server." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a4ca32f3-eadc-4f0c-adca-dc5ef06379ca", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/binder/CMR_queries.ipynb b/binder/CMR_queries.ipynb index dbdb6b2..b3ebd66 100644 --- a/binder/CMR_queries.ipynb +++ b/binder/CMR_queries.ipynb @@ -14,103 +14,46 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "id": "adc54031-dc9f-4858-83be-a84c6ee4eef0", "metadata": {}, "outputs": [], "source": [ - "from pydap.net import create_session\n", - "from pydap.client import get_cmr_urls" + "from pydap.client import get_cmr_urls\n", + "import datetime as dt" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "id": "f764fa2b-acc4-43ba-8762-b606dc4a96e4", "metadata": {}, "outputs": [], "source": [ - "ecostress_ccid = \"C2076114664-LPCLOUD\"" + "ECOSTRESS_ccid = \"C2076114664-LPCLOUD\"\n", + "bounding_box = [-128.847656,41.112469,-107.050781,46.679594]\n", + "time_range = [dt.datetime(2025, 3, 1), dt.datetime(2025, 3, 31)]" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "id": "c4306a74-3848-42ec-9e57-e396f9f47b80", "metadata": {}, "outputs": [], "source": [ - "urls = get_cmr_urls(ccid=ecostress_ccid, bounding_box=list((-130.8, 41, -124, 45)))" + "urls = get_cmr_urls(ccid=ECOSTRESS_ccid, bounding_box=bounding_box, time_range=time_range, limit=500)\n", + "print(\"Found \", len(urls), \"relevant opendap urls for ECOSTRESS data\")" ] }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "id": "68aa026d-51d5-4bab-a642-2a5c04258712", "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "['https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00152_003_20180716T130457_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00258_001_20180723T101233_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00289_001_20180725T100502_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00346_001_20180729T014458_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00346_002_20180729T014550_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00392_002_20180801T004543_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00392_003_20180801T004635_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00408_005_20180802T013113_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00423_001_20180803T003844_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00423_002_20180803T003936_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00438_002_20180803T234626_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00438_003_20180803T234718_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00441_002_20180804T043829_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00453_003_20180804T225449_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00457_005_20180805T052214_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00457_005_20180805T052214_0712_05',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00457_006_20180805T052306_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00457_007_20180805T052358_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00469_004_20180805T233925_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00469_004_20180805T233925_0712_05',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00469_005_20180805T234017_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00472_001_20180806T043039_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00472_002_20180806T043131_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00484_004_20180806T224658_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00484_004_20180806T224658_0712_05',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00484_005_20180806T224750_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00484_006_20180806T224842_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00487_002_20180807T033902_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00499_003_20180807T215521_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00499_004_20180807T215613_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00503_007_20180808T042338_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00503_008_20180808T042430_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00518_007_20180809T033122_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00518_008_20180809T033214_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00530_004_20180809T214742_0712_05',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00530_004_20180809T214742_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00530_005_20180809T214834_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00530_006_20180809T214926_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00533_005_20180810T023944_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00545_002_20180810T205606_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00545_003_20180810T205658_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00549_011_20180811T032420_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00549_011_20180811T032420_0712_03',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00576_004_20180812T204906_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00576_005_20180812T204958_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00579_008_20180813T013925_0712_03',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00579_008_20180813T013925_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00579_009_20180813T014017_0712_03',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00591_001_20180813T195545_0712_04',\n", - " 'https://opendap.earthdata.nasa.gov/collections/C2076114664-LPCLOUD/granules/ECOv002_L2_LSTE_00591_001_20180813T195545_0712_05']" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], + "outputs": [], "source": [ - "urls" + "urls[:10]" ] }, { @@ -138,7 +81,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.13" + "version": "3.12.12" } }, "nbformat": 4, diff --git a/binder/DAYMET_example.ipynb b/binder/DAYMET_example.ipynb new file mode 100644 index 0000000..0070adf --- /dev/null +++ b/binder/DAYMET_example.ipynb @@ -0,0 +1,403 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e4011688-ed19-470b-988b-cdd853a0974e", + "metadata": {}, + "source": [ + " **Accessing DAYMET data**\n", + "\n", + " Here, we are interested in dayly data for precipitation,.\n", + " \n", + " **Requirements**\n", + "1. EDL authentication (username/password)\n", + "2. **Optional**: If running notebook locally, install the conda environment in `environment.yml` file and install conda environment to run notebook.\n", + "\n", + "\n", + " **Objectives**\n", + "\n", + "- Find all relevant **OPeNDAP** URLs.\n", + "- Using OPeNDAP produced metadata, subset by variable name.\n", + "- Download coordinate data and identify spatial subset\n", + "- Stream data with Pydap+OPeNDAP, downloading only the data of interest.\n", + "\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5291bdb6-13c8-4c18-b66e-e4fddb9f3464", + "metadata": {}, + "outputs": [], + "source": [ + "import xarray as xr\n", + "import datetime as dt\n", + "import earthaccess\n", + "import numpy as np\n", + "\n", + "# import pydap-specific tools\n", + "from pydap.client import get_cmr_urls, open_url\n", + "from pydap.client import to_netcdf as dap_to_netcdf" + ] + }, + { + "cell_type": "markdown", + "id": "012b3cdf-dd51-4838-9b89-39a266ada3df", + "metadata": {}, + "source": [ + " **EDL Authentication via earthaccess and OPeNDAP**\n", + "\n", + " You can authenticate via earthaccess as demonstrated below. You must have a valid EDL account. There are two strategies for authenticating with `earthaccess`:\n", + "\n", + "1. `strategy=\"interactive\"`. This will promt your edl username-password.\n", + "2. `strategy=\"netrc\"`. Use this if the notebook is running on an environment where a `.netrc` with your credentials is recoverable.\n", + "\n", + "Below the default will be `interactive`.\n", + "\n", + " Having authenticated, pydap can inherit a requests.Session object from earthaccess, containing all valid credentials. When sending data requests to the NASA OPeNDAP server, the server will inherit these credentials, and verify authentication when needed to send data back." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "51b125f6-87ac-4bb2-9233-ac912da78d21", + "metadata": {}, + "outputs": [], + "source": [ + "auth = earthaccess.login(strategy=\"netrc\", persist=True) # you will be promted to add your EDL credentials\n", + "\n", + "# pass Token Authorization to a new Session.\n", + "my_session = auth.get_session()" + ] + }, + { + "cell_type": "markdown", + "id": "6c659d9d-4c3a-45b2-af2d-3002ee4be062", + "metadata": {}, + "source": [ + "# Finding OPeNDAP URLs\n", + "\n", + " **Query opendap urls using NASA's CMR API**\n", + "\n", + " Must provide:\n", + "\n", + "- Time range of interest\n", + "- Concept collection ID of interest.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ff131fc-1d33-4168-a107-f169ef2b1391", + "metadata": {}, + "outputs": [], + "source": [ + "daymet_ccid = \"C2531982907-ORNL_CLOUD\" # \n", + "time_range = [dt.datetime(2014, 5, 15), dt.datetime(2024, 5, 15)] # One month of data\n", + "\n", + "cmr_urls = get_cmr_urls(ccid=daymet_ccid, time_range=time_range, limit=1000) # you can incread the limit of results\n", + "\n", + "print(\"################################################ \\n We found a total of \", len(cmr_urls), \"OPeNDAP URLS!!!\\n################################################\")" + ] + }, + { + "cell_type": "markdown", + "id": "4eebcefe-fd6f-4bea-97ab-089a5e89a63f", + "metadata": {}, + "source": [ + "\n", + "### Further filtering\n", + "\n", + " The CMR returns all `precipitation` URLs from DAYMET. However, these are furthern split into three regions:\n", + "\n", + "1. Hawaii\n", + "2. Puerto Rico\n", + "3. North America (Continental US).\n", + "\n", + " We need to further filter these opendap urls to only retain variables of interest. In this tutorial we are only interested in: `Continental US` we then need to select only the URLs ending with\n", + "\n", + "* `Daymet_Annual_V4R1.daymet_v4_prcp_annttl_na_`\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "473ad9f4-dd63-4a9a-8991-b1d91df7cba4", + "metadata": {}, + "outputs": [], + "source": [ + "# filter url to only retain Annual Precip for North America\n", + "prcp_na_urls = [url for url in cmr_urls if url.split(\".nc\")[0].split(\"Daymet_Annual_V4R1.daymet_v4_\")[-1].startswith(\"prcp_annttl_na\")]\n", + "prcp_na_urls[:5]" + ] + }, + { + "cell_type": "markdown", + "id": "ca568631-d2c5-4904-86eb-25c1d9b1b6fd", + "metadata": {}, + "source": [ + "## Subset by variable and by lat/lon values\n", + "\n", + " We will use pydap to download only the metadata of 1 url. Since this is Level 4 data (model output), we only need to figure how to subset 1 file, and we can apply that to all files.\n", + "\n", + "### Area of Interest\n", + "\n", + " MidAtlantic, area surrounds by \n", + "```python\n", + "bounding-box = [36.5,-80.3, 40.5,-74 ] # Follows the format: [South, West, North, East]\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "242af998-0247-47f1-ab56-ad2a78dafc07", + "metadata": {}, + "outputs": [], + "source": [ + "pyds = open_url(prcp_na_urls[0], protocol=\"dap4\", session=my_session, batch=True)\n", + "pyds.tree()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "448634d3-ff5f-4467-9064-dbcd823dd055", + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.open_dataset(prcp_na_urls[0].replace(\"https\", \"dap4\"), session=my_session, engine='pydap')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2f8fbac-c0fb-4e99-9518-52c1acc213f7", + "metadata": {}, + "outputs": [], + "source": [ + "lon_min, lon_max = -80.3, -74\n", + "lat_min, lat_max = 36.5, 40.5" + ] + }, + { + "cell_type": "markdown", + "id": "91145845-d32b-419d-aa76-674757a621eb", + "metadata": {}, + "source": [ + "## Lets download data\n", + " But only the coordinates latitude and longitude" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "adc99194-0c1d-49cc-963d-2a8ba11bd408", + "metadata": {}, + "outputs": [], + "source": [ + "%time\n", + "lon = pyds['lon'][:].data\n", + "lat = pyds[\"lat\"][:].data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "56a40cf0-302f-4ed1-9413-11f8949076e9", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "lon, lat = np.asarray(lon), np.asarray(lat)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "23cc9953-c76a-4ae2-be3a-87238c1434a3", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"######################################## \\n longitude array size:\", lon.shape, \"\\n ########################################\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8def1c24-888d-4317-b792-a7d948bb4a22", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"######################################## \\n latitude array size:\", lat.shape, \"\\n ########################################\")" + ] + }, + { + "cell_type": "markdown", + "id": "1e640bfb-4b1e-4d31-9a46-8a9acfa97c6a", + "metadata": {}, + "source": [ + "## Find the indexes that define the area of interest" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "07979f37-7b96-4b15-9cc6-8ec5757ebe4e", + "metadata": {}, + "outputs": [], + "source": [ + "# 1) points that fall inside your lat/lon box\n", + "mask = (\n", + " (lon >= lon_min) & (lon <= lon_max) &\n", + " (lat >= lat_min) & (lat <= lat_max)\n", + ")\n", + "\n", + "rows, cols = np.where(mask)\n", + "# indexes below\n", + "y0, y1 = rows.min(), rows.max()\n", + "x0, x1 = cols.min(), cols.max()\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "f8676b37-ea87-46ee-a1f3-a011e2fc1cd2", + "metadata": {}, + "source": [ + "## Double check these values are reasonable" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "81503a70-5a82-4f1e-8d64-8249403b8b3b", + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"#################################################################################### \\n Data download will span longitude values: [{lon[y0:y1,x0:x1].ravel().min()}\" + \", \" + f\"{lon[y0:y1,x0:x1].ravel().max()}] \\n####################################################################################\" )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "430e19d4-3348-4800-a679-5b8258d92540", + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"#################################################################################### \\n Data download will span latitude values: [{lat[y0:y1,x0:x1].ravel().min()}\" + \", \" + f\"{lat[y0:y1,x0:x1].ravel().max()}] \\n ####################################################################################\" )" + ] + }, + { + "cell_type": "markdown", + "id": "d7127909-619d-4b00-8433-dc3ff438fab3", + "metadata": {}, + "source": [ + "### Define pydap-specific parameters\n", + "\n", + " These are needed to:\n", + "- Subset close to remote data (so only data of interest is downloaded)\n", + "- Define where to store data in local environment\n", + "\n", + " By default, if no parameter is defined, it will download the entire data and place it in the current directory" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d043588c-1c91-4afc-940e-bcc8e39dc556", + "metadata": {}, + "outputs": [], + "source": [ + "dim_slices = {'/y':(y0,y1), '/x': (x0,x1)} # defines index to subset format: (first, last)\n", + "keep_vars = [\"/time\", \"/y\", \"/x\", \"/lon\", \"/lat\", \"/prcp\"] # variables to download\n", + "output_path = \"data\"" + ] + }, + { + "cell_type": "markdown", + "id": "4c6f33bc-7db3-4edf-8c82-e4a15f4a862d", + "metadata": {}, + "source": [ + "# Stream and deserialize data\n", + "\n", + " Pydap will store each remote file into is own individual file (each file will have the same name as that of the source file), instead of aggregating all the data. This is considerable safer (since not all data can be aggregated into single datacube), and enables parallelism.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "756404a3-f524-445d-82e7-fcbe0cf85ba2", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "dap_to_netcdf(prcp_na_urls, session=my_session, output_path = output_path, dim_slices=dim_slices, keep_variables=keep_vars)" + ] + }, + { + "cell_type": "markdown", + "id": "0bbd1767-075e-4bbd-8dac-38c309425fa0", + "metadata": {}, + "source": [ + "## Inspect data locally\n", + "\n", + " Once in your local system, you can aggregate the files if needed. In this case all data can be aggregated and is considerable much easier to aggregate these locally, than remotely.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0727d165-5120-417e-9319-6d92b759c101", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "ds = xr.open_mfdataset(\"data/daymet_v4_prcp_annttl_na*\", parallel=True, concat_dim='time', combine='nested')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cedaa044-cf84-49a1-968d-5996ce924cfc", + "metadata": {}, + "outputs": [], + "source": [ + "ds['prcp'].isel(time=1).plot()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e5dedf41-0817-4a6d-ba26-400fdaea0fa8", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/binder/TEMPO_tutorials.ipynb b/binder/TEMPO_tutorials.ipynb index c603934..615427c 100644 --- a/binder/TEMPO_tutorials.ipynb +++ b/binder/TEMPO_tutorials.ipynb @@ -5,13 +5,11 @@ "id": "dd1d5518-f9a5-4a7f-a5fc-16d6a7394ee4", "metadata": {}, "source": [ - " **Intro Tutorial Notebook OPeNDAP (with Python)**\n", + " **ACCESS to TEMPO data via OPeNDAP**\n", "\n", " **Requirements**\n", "1. EDL authentication (username/password)\n", - "2. Generate a Bearer Token.\n", - "3. Get Notebook from NASA-tutorial Github repository\n", - "4. Get `environment.yml` file and install conda environment to run notebook.\n", + "2. **Optional**: If running notebook locally, install the conda environment in `environment.yml` file and install conda environment to run notebook.\n", "\n", "\n", " **Objectives**\n", @@ -29,9 +27,7 @@ "### Subset multiple remote files\n", "\n", "- **a)** Naive approaches.\n", - "- **b)** Streaming data\n", - "\n", - "## Appendix: Using curl\n" + "- **b)** Streaming data\n" ] }, { @@ -46,8 +42,8 @@ "import earthaccess\n", "\n", "# import pydap-specific tools\n", - "from pydap.net import create_session, extract_session_state\n", - "from pydap.client import get_cmr_urls, consolidate_metadata, stream, stream_parallel, open_url\n" + "from pydap.client import get_cmr_urls, open_url\n", + "from pydap.client import to_netcdf as dap_to_netcdf" ] }, { @@ -86,36 +82,6 @@ "cmr_urls[0]" ] }, - { - "cell_type": "markdown", - "id": "8d9882ef-e8cd-49f8-b627-026047124459", - "metadata": {}, - "source": [ - "# Inspecting Metadata\n", - "\n", - " **Understanding DAP4**\n", - "\n", - "
\n", - "\"drawing\" \n", - "
\n", - "\n", - "\n", - "### Basic Data Exploration\n", - "\n", - "To inspect the metadata of a remote OPeNDAP URL (variables in the file), you :\n", - "\n", - "* Append a `.dmr` at the end of the URL\n", - "* Open on a browser.\n", - "* Useful when interested in subseting by variables for example.\n", - "\n", - "For example:\n", - "\n", - "https://opendap.earthdata.nasa.gov/collections/C2930764281-LARC_CLOUD/granules/TEMPO_O3TOT_L3_V03_20250831T232841Z_S016.nc.dmr\n", - "\n", - "\n", - "\n" - ] - }, { "cell_type": "markdown", "id": "6e04deee-6c59-4437-9aed-381a73085496", @@ -139,10 +105,10 @@ "metadata": {}, "outputs": [], "source": [ - "auth = earthaccess.login(strategy=\"interactive\", persist=True) # you will be promted to add your EDL credentials\n", + "auth = earthaccess.login(strategy=\"netrc\", persist=True) # you will be promted to add your EDL credentials\n", "\n", "# pass Token Authorization to a new Session.\n", - "my_session = create_session(session=auth.get_session())" + "my_session = session=auth.get_session()" ] }, { @@ -150,18 +116,18 @@ "id": "a2322093-b9b6-49c5-8bb4-7ad17bffb95d", "metadata": {}, "source": [ - "# Accessing Metadata\n", + " **Server-Side Subsetting**\n", "\n", - " What are some tools, their differences, and what do they do.\n", + "### Subset by Variable name\n", "\n", - " **PYDAP: Metadata-Only**\n", + " To subset by variable name, we first need to know what variables there are in the remote file. Below we access the metadata-only of the remote file via PyDAP, reusing the session object.\n", "\n", + " Below we request the DAP4 (hierarchial) metadata from the remote server. To specify the protocol, we have 2 options:\n", "\n", + "1. `url.replace(\"https\",\"dap4\")`. This replaces the `https` with `dap4`. \"dap4\" is not a separate protocol from https, it is simply a way to tell the client (PyDAP or Xarray) which type of metadata request to send the server.\n", + "2. With PyDAP one can also define the argument `\"protocol=\"dap4\"` when opening a url. \n", "\n", - "```{note}\n", - "Q: How do we tell the server which protocol to use?\n", - "A: By replaing the http -> dap4 in the URL\n", - "```\n" + " Below we follow option #1) above.\n" ] }, { @@ -172,212 +138,44 @@ "outputs": [], "source": [ "%%time\n", - "pyds = open_url(cmr_urls[0].replace(\"https\",\"dap4\"), session=my_session)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "31f65365-d160-43e0-b1c3-36312121c91e", - "metadata": {}, - "outputs": [], - "source": [ + "pyds = open_url(cmr_urls[0].replace(\"https\",\"dap4\"), session=my_session)\n", "pyds.tree()" ] }, { "cell_type": "code", "execution_count": null, - "id": "7656add1-270c-4ef6-8a67-e4d3096be0e4", + "id": "891e3d5e-b8aa-4003-8b61-6939cbdfeeec", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", - "id": "d6e34e90-59bc-4d9b-b466-aed912e33d26", + "id": "882690ad-1282-41d1-b0c8-e2610522011a", "metadata": {}, "source": [ - " **XARRAY: PYDAP as Engine**\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "449cc5f1-9686-4a7d-a07b-8850697aea76", - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "dt = xr.open_datatree(cmr_urls[0].replace(\"https\",\"dap4\"), session=my_session, engine='pydap')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5374a546-bebc-4861-86a7-bb0efcfeddb5", - "metadata": {}, - "outputs": [], - "source": [ - "dt" - ] - }, - { - "cell_type": "markdown", - "id": "ed9d34db-51d9-4a6c-b24c-4b8b021b141e", - "metadata": {}, - "source": [ - "\n", - "```{note}\n", - "Q: Why is it slower with Xarray + Pydap?\n", - "A: Xarray eagerly downloads all dimension data into memory before creating the Dataset Object\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "id": "483fb042-0492-4d74-aa81-8ee05b3f254e", - "metadata": {}, - "source": [ - " **NAIVE APPROACHES when accessing OPeNDAP in the Cloud**\n", + " Having now the full list of variables, we are interested in:\n", + "* Geographic location data,\n", + "* The variable `weight`.\n", + "* Time.\n", + "* All of the data inside the group `support_data`\n", + "* The variable `o3_below_cloud` for further processing.\n", "\n", - "* When aggregating multiple remote files with Xarray for data exploration\n", - "* When downloading data into a file\n", "\n", - "### Solutions\n", - "* Use pydap's logic to increase performance\n", - "* Construct Constraint Expressions to reduce the metadata that Xarray parses\n", + " The remote file contains hierarchies such as Groups, which act like folder in a filesystem. In DAP4 we can define variables across different hierarchies, by defining their `fully qualifying name`, analogous to a file inside a filesystem. \n", "\n", - "```{note}\n", - "While Xarray has a .drop_variables method to \"drop\" variables before say storing into a file, this dropping takes a place after creating the Xarray Dataset object. In some cases where a single granule has 1000 variables, this approach can be subperformant.\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "id": "cdd6888a-2ea8-45fa-9f0d-26dbb0466c76", - "metadata": {}, - "source": [ - " **How to best Aggregate Multiple Files with Xarray**\n", - "\n", - "\n", - " Below we demonstrate the performance of Xarray when aggregating multiple remote granules with OPeNDAP URLs (over https) with 2 approaches:\n", - "\n", - "* Naive approach.\n", - "* Employing PyDAP's internal methods for \"consolidating metadata\".\n" - ] - }, - { - "cell_type": "markdown", - "id": "20a3f3ba-e39b-47db-8fd7-222848eb07da", - "metadata": {}, - "source": [ - " **Naive Approach**\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "2ca75ee2-b801-4d58-96aa-ad4db88b967a", - "metadata": {}, - "outputs": [], - "source": [ - "# Convert URLS into DAP4 urls\n", - "dap4_urls = [url.replace(\"https\", 'dap4') for url in cmr_urls]\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "137620e4-1493-4fbd-ab7e-848b35018e9a", - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "ds = xr.open_mfdataset(dap4_urls, engine='pydap', session=my_session, parallel=True, concat_dim='time', combine='nested')" + " The list of variables we are interested in are declared in `keep_variables`, below\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "67fd70eb-a3e7-48f2-b246-9af3b15709ad", + "id": "8c723e3b-7300-451a-9372-766a5074fb4a", "metadata": {}, "outputs": [], "source": [ - "ds" - ] - }, - { - "cell_type": "markdown", - "id": "bc72b93e-1bdb-46f5-af78-43e6a541dcff", - "metadata": {}, - "source": [ - " **NOT-SO Naive Approach**\n", - "\n", - "* Consolidating Metadata via PYDAP.\n", - "\n", - " This approach generates a SQLite object on a local directory, storing all metadata for later reuse. THis is, can consolidate all metadata into a single file which pydap can natively use to speed up the Xarray Dataset object generation. We demonstrate below.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "76575c50-e7ba-4e9d-88c3-254d7efabb0e", - "metadata": {}, - "outputs": [], - "source": [ - "metadata_name = \"./data/TEMPO_O3TOT_L3_V03\"\n", - "cache_kwargs = {'cache_name': metadata_name}\n", - "new_session = create_session(use_cache=True, session=auth.get_session(), cache_kwargs=cache_kwargs)\n", - "new_session" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "5c6d9ac9-f0d8-4945-8c55-15f810355810", - "metadata": {}, - "outputs": [], - "source": [ - "new_session.cache.clear()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "73881cc6-e483-488b-aa74-3461f0d60651", - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "consolidate_metadata(dap4_urls, session=new_session, concat_dim='time')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "1659c098-a1aa-44aa-947b-11d0be240e50", - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "ds1 = xr.open_mfdataset(dap4_urls, engine='pydap', session=new_session, parallel=True, concat_dim='time', combine='nested')\n", - "ds1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6bc358e6-a612-4d6b-bc99-84faee49d2d5", - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "ds2 = xr.open_mfdataset(dap4_urls, engine='pydap', session=new_session, parallel=True, concat_dim='time', combine='nested', group='/product')\n", - "ds3 = xr.open_mfdataset(dap4_urls, engine='pydap', session=new_session, parallel=True, concat_dim='time', combine='nested', group='/support_data')\n", - "ds4 = xr.open_mfdataset(dap4_urls, engine='pydap', session=new_session, parallel=True, concat_dim='time', combine='nested', group='/geolocation')\n", - "ds = xr.merge([ds1, ds2, ds3, ds4])\n", - "ds" + "keep_variables = ['/longitude', '/latitude', '/time', \"/support_data\", \"/product/o3_below_cloud\"]\n" ] }, { @@ -385,33 +183,12 @@ "id": "3d19b3c3-8017-4f7c-b5a2-b38a70622672", "metadata": {}, "source": [ - " **Server-Side Subsetting**\n", - "\n", - "* The DAP4 protocol supports server-side subseting by **a)** Variable name, and by **b)** Index space with the use of Constraint Expressions (additional query parameters appended to each URL).\n", - "* Constraint Expressions (CEs) can be used to **a)** speed up metadata dataset object creation, and to download a subset of data (as opposed to downloading entire file and subsetting locally).\n", - "\n", + "### Subset by coordinate values\n", "\n", - " **Below we demonstrate how to make sure the subset is done by the server and not by Xarray once the data has been downloaded**.\n", - " Because xarray loads dimension (coordinates) into memory,for L3 data it is easy to identify a single spatial subset that applies to all granules.\n", + " Below we use Xarray to download coordinate data using PyDAP in the background. THe goal is to identify a region of interest and only download data inside that region of interest.\n", "\n" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "6ca18a29-2928-43cd-a9b0-0778dfb2bc61", - "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "9084be16-040d-40c9-9ca7-9f0ea6998a42", - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "code", "execution_count": null, @@ -420,176 +197,38 @@ "outputs": [], "source": [ "# define bounding box by edges\n", - "lat_min, lat_max = 45, 65\n", + "lat_min, lat_max = 45, 60\n", "lon_min, lon_max = -123, -120.5\n", "\n", - "lat_index = dt.latitude.to_index()\n", - "lon_index = dt.longitude.to_index()\n", + "ds = xr.open_dataset(cmr_urls[0].replace(\"https\",\"dap4\"), session=my_session, engine='pydap')\n", + "\n", + "lat_index = ds.latitude.to_index()\n", + "lon_index = ds.longitude.to_index()\n", "\n", "lat_start = lat_index.get_indexer([lat_min], method=\"nearest\")[0]\n", "lat_end = lat_index.get_indexer([lat_max], method=\"nearest\")[0]\n", "\n", "lon_start = lon_index.get_indexer([lon_min], method=\"nearest\")[0]\n", - "lon_end = lon_index.get_indexer([lon_max], method=\"nearest\")[0]\n" - ] - }, - { - "cell_type": "markdown", - "id": "f418dd76-58a2-48a3-ad79-a98a319e65a5", - "metadata": {}, - "source": [ - "### Single granule case" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "30c5eeca-1ee5-4840-ac36-8ae394e6cfc2", - "metadata": {}, - "outputs": [], - "source": [ - "ds = xr.open_datatree(dap4_urls[0], engine='pydap', session=my_session)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "73e9d71c-df1f-4ced-86d2-0568ae89a580", - "metadata": {}, - "outputs": [], - "source": [ - "%%time\n", - "ds['support_data/terrain_height'].isel(latitude=slice(lat_start, lat_end), longitude=slice(lon_start, lon_end)).plot()" - ] - }, - { - "cell_type": "markdown", - "id": "d8c0e9c1-8092-46d2-80d2-05f475571fb5", - "metadata": {}, - "source": [ - " **When slicing a Variable with Xarray (for a single granule)** Xarray passes down via pydap the request so that subsetting takes place close to the data. \n", - "\n", - "\n", - "* **When creating an aggregated view of the dataset, to ensure the server does the subsetting, the user must pass a chunk argument when creating the dataset.**\n", - "\n", - "For example:\n", - "```python\n", - "ds1 = xr.open_mfdataset(\n", - " dap4_urls, engine='pydap', \n", - " session=new_session, \n", - " parallel=True, \n", - " concat_dim='time', \n", - " combine='nested', \n", - " group='/',\n", - " chunk={'latitude': size_of_lat_slice, 'longitude': size_of_lon_slice}\n", - ")\n", - "\n", - "ds2 = xr.open_mfdataset(\n", - " dap4_urls, engine='pydap', \n", - " session=new_session, \n", - " parallel=True, \n", - " concat_dim='time', \n", - " combine='nested', \n", - " group='/support_data',\n", - " chunk={'latitude': size_of_lat_slice, 'longitude': size_of_lon_slice}\n", - ")\n", + "lon_end = lon_index.get_indexer([lon_max], method=\"nearest\")[0]\n", "\n", - "ds = xr.merge([ds1, ds2])\n", - "ds['support_data/terrain_height'].isel(longitude=slice(lon_start, lon_end), latitude=slice(lat_start,lat_end)).to_netcdf('local_file')\n", - "```\n", - "\n", - " **If you do not chunk when creating the dataset, Xarray will download the ENTIRE variable into memory and then subset it, resulting in subpar performance and unnecessary data transfer (bug on Xarray).**\n", - "\n", - "\n", - "\n", - " For reference, check this observation on the pydap documentation: \n", - "* [Subsetting 2 OPenDAP URLs](https://pydap.github.io/pydap/en/5_minute_tutorial.html#case-2-subsetting-across-two-separate-files), in particular the observation made in this block:\n", - "* [How to pass the slice from Xarray to the remote Server](https://pydap.github.io/pydap/en/5_minute_tutorial.html#how-to-pass-the-slice-from-xarray-to-the-remote-server)" - ] - }, - { - "cell_type": "markdown", - "id": "cfaab203-a8fc-4f4a-860e-c90978d788ff", - "metadata": {}, - "source": [ - " **Subsetting by Variable Names**\n", - "\n", - " This can be useful when the vast majory of variables will be discarded. In particular when the granule has O(100-1000) variables (typical of Level2 datasets). In that scenario, simply openning an Xarray object can take ~ 10-100 seconds or more for a single granule, since Xarray needs to parse all the metadata.\n", - "\n", - " A URL can be constructed that tells the OPeNDAP server which variables (and their dimensions) a user is interested. \n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "cfb8d09c-dd9b-4407-a9bb-5c8705c9aa1f", - "metadata": {}, - "outputs": [], - "source": [ - "keep_variables = ['/longitude', '/latitude', '/time', \"/support_data\", \"/product/o3_below_cloud\"]\n", - "\n", - "CE = \"?dap4.ce=\" + \";\".join(keep_variables)\n", - "\n", - "dap4ce_urls = [url+CE for url in dap4_urls ]\n", - "dap4ce_urls[0]" + "# define slicing argument to use in PyDAP to stream the subset of data\n", + "dim_slices = {'latitude': (lat_start, lat_end), 'longitude': (lon_start, lon_end)}\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "44ebb3dc-2a1f-4ce9-a4e6-8021e41c5934", + "id": "cb476edf-611b-4768-888a-3cb9a63703f2", "metadata": {}, "outputs": [], - "source": [ - "%%time\n", - "dt = xr.open_datatree(dap4ce_urls[0], engine='pydap', session=my_session)\n", - "dt" - ] + "source": [] }, { "cell_type": "markdown", - "id": "6d3b2e98-5050-4e46-85e0-f977efc69174", - "metadata": {}, - "source": [ - " **Recommended Worklow for Streaming Data**\n", - "\n", - "\n", - "* **Data Exploration**: `Xarray + PyDAP`. When interested in visualizing one or two variables. Identify regions of interest (interactively).\n", - "\n", - "However, if the goal is to stream data via OPeNDAP, Xarray+opendap requires extra logic to get the **best performance** that is stil subpar compared to **talking to the server directly**\n", - "\n", - "\n", - "* **Streaming a Data Subset**: Pure PyDAP to parallelize direct server (subset) requests. See below:\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "7a5fbc55-d3a7-4d66-b94b-8fee84a3645c", - "metadata": {}, - "outputs": [], - "source": [ - "dim_slices = {'latitude': (lat_start, lat_end), 'longitude': (lon_start, lon_end)}\n", - "\n", - "print('Varibles I want to stream into local file: \\n\\n', keep_variables, '\\n\\n ========================================================================') \n", - "\n", - "print(\"\\npassing down to the server the following slice that will be applied to data variables of interest: \\n\\n\", dim_slices,'\\n\\n ========================================================================')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "6879445c-cae7-4b7b-a729-f5f7e4e8c805", + "id": "400eab9f-71d9-413d-a1b3-e10bee2f8d5f", "metadata": {}, - "outputs": [], "source": [ - "# This is necessary to pass down credentials since requests.Session is not a pickable object\n", - "session_state = session_state = extract_session_state(my_session)" + "### Stream data " ] }, { @@ -600,14 +239,7 @@ "outputs": [], "source": [ "%%time\n", - "stream_parallel(\n", - " cmr_urls[:4],\n", - " session_state, \n", - " keep_variables=keep_variables, \n", - " dim_slices=dim_slices, \n", - " output_path=\"./data/\", \n", - " max_workers=4,\n", - ")" + "dap_to_netcdf(cmr_urls[:30], session=my_session, output_path = \"./data\", dim_slices=dim_slices, keep_variables=keep_variables)" ] }, { @@ -626,9 +258,9 @@ "outputs": [], "source": [ "%%time\n", - "ds1 = xr.open_mfdataset(\"TEMPO_O3*.nc\", group=\"/\", parallel=True)\n", - "ds2 = xr.open_mfdataset(\"TEMPO_O3*.nc\", group=\"/support_data\", parallel=True, concat_dim='time', combine='nested')\n", - "ds3 = xr.open_mfdataset(\"TEMPO_O3*.nc\", group=\"/product\", parallel=True, concat_dim='time', combine='nested')\n", + "ds1 = xr.open_mfdataset(\"data/TEMPO_O3*.nc4\", group=\"/\", parallel=True)\n", + "ds2 = xr.open_mfdataset(\"data/TEMPO_O3*.nc4\", group=\"/support_data\", parallel=True, concat_dim='time', combine='nested')\n", + "ds3 = xr.open_mfdataset(\"data/TEMPO_O3*.nc4\", group=\"/product\", parallel=True, concat_dim='time', combine='nested')\n", "ds = xr.merge([ds1, ds2, ds3])" ] }, @@ -642,21 +274,15 @@ "ds" ] }, - { - "cell_type": "code", - "execution_count": null, - "id": "996af357-ed03-4aa3-bb0f-4c6c05f6132b", - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "code", "execution_count": null, "id": "1204acec-86ab-477d-b4fb-4ac33a228d1a", "metadata": {}, "outputs": [], - "source": [] + "source": [ + "ds['terrain_height'].isel(time=0).plot();" + ] } ], "metadata": { diff --git a/binder/ECCO.ipynb b/binder/Xarray/ECCO.ipynb similarity index 100% rename from binder/ECCO.ipynb rename to binder/Xarray/ECCO.ipynb diff --git a/binder/Iceberg_drift.ipynb b/binder/Xarray/Iceberg_drift.ipynb similarity index 99% rename from binder/Iceberg_drift.ipynb rename to binder/Xarray/Iceberg_drift.ipynb index 2cb88d9..e491d0d 100644 --- a/binder/Iceberg_drift.ipynb +++ b/binder/Xarray/Iceberg_drift.ipynb @@ -511,7 +511,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.11" + "version": "3.12.12" } }, "nbformat": 4, diff --git a/binder/MERRA-2_Access.ipynb b/binder/Xarray/MERRA-2_Access.ipynb similarity index 100% rename from binder/MERRA-2_Access.ipynb rename to binder/Xarray/MERRA-2_Access.ipynb diff --git a/binder/OSCAR.ipynb b/binder/Xarray/OSCAR.ipynb similarity index 100% rename from binder/OSCAR.ipynb rename to binder/Xarray/OSCAR.ipynb diff --git a/binder/Xarray/TEMPO_tutorials.ipynb b/binder/Xarray/TEMPO_tutorials.ipynb new file mode 100644 index 0000000..01774f4 --- /dev/null +++ b/binder/Xarray/TEMPO_tutorials.ipynb @@ -0,0 +1,681 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "dd1d5518-f9a5-4a7f-a5fc-16d6a7394ee4", + "metadata": {}, + "source": [ + " **ACCESS to TEMPO data via OPeNDAP**\n", + "\n", + " **Requirements**\n", + "1. EDL authentication (username/password)\n", + "2. **Optional**: If running notebook locally, install the conda environment in `environment.yml` file and install conda environment to run notebook.\n", + "\n", + "\n", + " **Objectives**\n", + "### Basics\n", + "- Brief Introduction to **OPeNDAP** (i.e. **dap2** vs **dap4**). \n", + "- How to find **OPeNDAP** URLs.\n", + "- Inspecting metadata (differences between **browser** / **pydap** and **Xarray**).\n", + "- **Naive approach**: access data from a url using **Xarray** + **pydap**. Here we demonstrate different ways to authenticate.\n", + "\n", + "### Subset a remote file\n", + "\n", + "- **a)** By Variables\n", + "- **b)** By Spatial selection\n", + "\n", + "### Subset multiple remote files\n", + "\n", + "- **a)** Naive approaches.\n", + "- **b)** Streaming data\n", + "\n", + "## Appendix: Using curl\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "740c000d-5314-4aaf-81a5-d70d1e79f9fe", + "metadata": {}, + "outputs": [], + "source": [ + "import xarray as xr\n", + "import datetime as dt\n", + "import earthaccess\n", + "\n", + "# import pydap-specific tools\n", + "from pydap.net import create_session, extract_session_state\n", + "from pydap.client import get_cmr_urls, consolidate_metadata, stream, stream_parallel, open_url\n" + ] + }, + { + "cell_type": "markdown", + "id": "f358904c-f969-43c4-93ce-4d137c0492c8", + "metadata": {}, + "source": [ + "# Finding OPeNDAP URLs\n", + "\n", + " **Query opendap urls using NASA's CMR API**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c7bdc702-f426-4b0a-a49f-2649ecdc7c18", + "metadata": {}, + "outputs": [], + "source": [ + "TEMPO_O3TOT_L3_V03_ccid = \"C2930764281-LARC_CLOUD\"\n", + "time_range = [dt.datetime(2025, 9, 1), dt.datetime(2025, 9, 30)] # One month of data\n", + "\n", + "cmr_urls = get_cmr_urls(ccid=TEMPO_O3TOT_L3_V03_ccid, time_range=time_range, limit=1000) # you can incread the limit of results\n", + "\n", + "\n", + "print(\"################################################ \\n We found a total of \", len(cmr_urls), \"OPeNDAP URLS!!!\\n################################################\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f9ea1330-8390-4ae2-b580-45f039bf9aa6", + "metadata": {}, + "outputs": [], + "source": [ + "cmr_urls[0]" + ] + }, + { + "cell_type": "markdown", + "id": "8d9882ef-e8cd-49f8-b627-026047124459", + "metadata": {}, + "source": [ + "# Inspecting Metadata\n", + "\n", + " **Understanding DAP4**\n", + "\n", + "
\n", + "\"drawing\" \n", + "
\n", + "\n", + "\n", + "### Basic Data Exploration\n", + "\n", + "To inspect the metadata of a remote OPeNDAP URL (variables in the file), you :\n", + "\n", + "* Append a `.dmr` at the end of the URL\n", + "* Open on a browser.\n", + "* Useful when interested in subseting by variables for example.\n", + "\n", + "For example:\n", + "\n", + "https://opendap.earthdata.nasa.gov/collections/C2930764281-LARC_CLOUD/granules/TEMPO_O3TOT_L3_V03_20250831T232841Z_S016.nc.dmr\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "6e04deee-6c59-4437-9aed-381a73085496", + "metadata": {}, + "source": [ + " **EDL Authentication via OPeNDAP**\n", + "\n", + " You can authenticate via:\n", + "\n", + "* `.netrc` file (username password)\n", + "* Token bearer header\n", + "\n", + "\n", + " OPeNDAP's Hyrax server support both forms of authentication. Below we demonstrate using earthaccess to store and inherit EDL credentials into a session that will be used to stream data from OPeNDAP in the Cloud.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2725988f-618e-4ed4-879c-3adc0c4771e4", + "metadata": {}, + "outputs": [], + "source": [ + "auth = earthaccess.login(strategy=\"interactive\", persist=True) # you will be promted to add your EDL credentials\n", + "\n", + "# pass Token Authorization to a new Session.\n", + "my_session = create_session(session=auth.get_session())" + ] + }, + { + "cell_type": "markdown", + "id": "a2322093-b9b6-49c5-8bb4-7ad17bffb95d", + "metadata": {}, + "source": [ + "# Accessing Metadata\n", + "\n", + " What are some tools, their differences, and what do they do.\n", + "\n", + " **PYDAP: Metadata-Only**\n", + "\n", + "\n", + "\n", + "```{note}\n", + "Q: How do we tell the server which protocol to use?\n", + "A: By replaing the http -> dap4 in the URL\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3179f1c6-8661-4f46-9f80-5c75d411af4c", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "pyds = open_url(cmr_urls[0].replace(\"https\",\"dap4\"), session=my_session)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31f65365-d160-43e0-b1c3-36312121c91e", + "metadata": {}, + "outputs": [], + "source": [ + "pyds.tree()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7656add1-270c-4ef6-8a67-e4d3096be0e4", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "d6e34e90-59bc-4d9b-b466-aed912e33d26", + "metadata": {}, + "source": [ + " **XARRAY: PYDAP as Engine**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "449cc5f1-9686-4a7d-a07b-8850697aea76", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "dt = xr.open_datatree(cmr_urls[0].replace(\"https\",\"dap4\"), session=my_session, engine='pydap')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5374a546-bebc-4861-86a7-bb0efcfeddb5", + "metadata": {}, + "outputs": [], + "source": [ + "dt" + ] + }, + { + "cell_type": "markdown", + "id": "ed9d34db-51d9-4a6c-b24c-4b8b021b141e", + "metadata": {}, + "source": [ + "\n", + "```{note}\n", + "Q: Why is it slower with Xarray + Pydap?\n", + "A: Xarray eagerly downloads all dimension data into memory before creating the Dataset Object\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "483fb042-0492-4d74-aa81-8ee05b3f254e", + "metadata": {}, + "source": [ + " **NAIVE APPROACHES when accessing OPeNDAP in the Cloud**\n", + "\n", + "* When aggregating multiple remote files with Xarray for data exploration\n", + "* When downloading data into a file\n", + "\n", + "### Solutions\n", + "* Use pydap's logic to increase performance\n", + "* Construct Constraint Expressions to reduce the metadata that Xarray parses\n", + "\n", + "```{note}\n", + "While Xarray has a .drop_variables method to \"drop\" variables before say storing into a file, this dropping takes a place after creating the Xarray Dataset object. In some cases where a single granule has 1000 variables, this approach can be subperformant.\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "cdd6888a-2ea8-45fa-9f0d-26dbb0466c76", + "metadata": {}, + "source": [ + " **How to best Aggregate Multiple Files with Xarray**\n", + "\n", + "\n", + " Below we demonstrate the performance of Xarray when aggregating multiple remote granules with OPeNDAP URLs (over https) with 2 approaches:\n", + "\n", + "* Naive approach.\n", + "* Employing PyDAP's internal methods for \"consolidating metadata\".\n" + ] + }, + { + "cell_type": "markdown", + "id": "20a3f3ba-e39b-47db-8fd7-222848eb07da", + "metadata": {}, + "source": [ + " **Naive Approach**\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ca75ee2-b801-4d58-96aa-ad4db88b967a", + "metadata": {}, + "outputs": [], + "source": [ + "# Convert URLS into DAP4 urls\n", + "dap4_urls = [url.replace(\"https\", 'dap4') for url in cmr_urls]\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "137620e4-1493-4fbd-ab7e-848b35018e9a", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "ds = xr.open_mfdataset(dap4_urls, engine='pydap', session=my_session, parallel=True, concat_dim='time', combine='nested')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "67fd70eb-a3e7-48f2-b246-9af3b15709ad", + "metadata": {}, + "outputs": [], + "source": [ + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "bc72b93e-1bdb-46f5-af78-43e6a541dcff", + "metadata": {}, + "source": [ + " **NOT-SO Naive Approach**\n", + "\n", + "* Consolidating Metadata via PYDAP.\n", + "\n", + " This approach generates a SQLite object on a local directory, storing all metadata for later reuse. THis is, can consolidate all metadata into a single file which pydap can natively use to speed up the Xarray Dataset object generation. We demonstrate below.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76575c50-e7ba-4e9d-88c3-254d7efabb0e", + "metadata": {}, + "outputs": [], + "source": [ + "metadata_name = \"./data/TEMPO_O3TOT_L3_V03\"\n", + "cache_kwargs = {'cache_name': metadata_name}\n", + "new_session = create_session(use_cache=True, session=auth.get_session(), cache_kwargs=cache_kwargs)\n", + "new_session" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c6d9ac9-f0d8-4945-8c55-15f810355810", + "metadata": {}, + "outputs": [], + "source": [ + "new_session.cache.clear()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73881cc6-e483-488b-aa74-3461f0d60651", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "consolidate_metadata(dap4_urls, session=new_session, concat_dim='time')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1659c098-a1aa-44aa-947b-11d0be240e50", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "ds1 = xr.open_mfdataset(dap4_urls, engine='pydap', session=new_session, parallel=True, concat_dim='time', combine='nested')\n", + "ds1" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bc358e6-a612-4d6b-bc99-84faee49d2d5", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "ds2 = xr.open_mfdataset(dap4_urls, engine='pydap', session=new_session, parallel=True, concat_dim='time', combine='nested', group='/product')\n", + "ds3 = xr.open_mfdataset(dap4_urls, engine='pydap', session=new_session, parallel=True, concat_dim='time', combine='nested', group='/support_data')\n", + "ds4 = xr.open_mfdataset(dap4_urls, engine='pydap', session=new_session, parallel=True, concat_dim='time', combine='nested', group='/geolocation')\n", + "ds = xr.merge([ds1, ds2, ds3, ds4])\n", + "ds" + ] + }, + { + "cell_type": "markdown", + "id": "3d19b3c3-8017-4f7c-b5a2-b38a70622672", + "metadata": {}, + "source": [ + " **Server-Side Subsetting**\n", + "\n", + "* The DAP4 protocol supports server-side subseting by **a)** Variable name, and by **b)** Index space with the use of Constraint Expressions (additional query parameters appended to each URL).\n", + "* Constraint Expressions (CEs) can be used to **a)** speed up metadata dataset object creation, and to download a subset of data (as opposed to downloading entire file and subsetting locally).\n", + "\n", + "\n", + " **Below we demonstrate how to make sure the subset is done by the server and not by Xarray once the data has been downloaded**.\n", + " Because xarray loads dimension (coordinates) into memory,for L3 data it is easy to identify a single spatial subset that applies to all granules.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ca18a29-2928-43cd-a9b0-0778dfb2bc61", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9084be16-040d-40c9-9ca7-9f0ea6998a42", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "602f104d-25cf-4dfb-ac9f-6cf40fc5644f", + "metadata": {}, + "outputs": [], + "source": [ + "# define bounding box by edges\n", + "lat_min, lat_max = 45, 65\n", + "lon_min, lon_max = -123, -120.5\n", + "\n", + "lat_index = dt.latitude.to_index()\n", + "lon_index = dt.longitude.to_index()\n", + "\n", + "lat_start = lat_index.get_indexer([lat_min], method=\"nearest\")[0]\n", + "lat_end = lat_index.get_indexer([lat_max], method=\"nearest\")[0]\n", + "\n", + "lon_start = lon_index.get_indexer([lon_min], method=\"nearest\")[0]\n", + "lon_end = lon_index.get_indexer([lon_max], method=\"nearest\")[0]\n" + ] + }, + { + "cell_type": "markdown", + "id": "f418dd76-58a2-48a3-ad79-a98a319e65a5", + "metadata": {}, + "source": [ + "### Single granule case" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "30c5eeca-1ee5-4840-ac36-8ae394e6cfc2", + "metadata": {}, + "outputs": [], + "source": [ + "ds = xr.open_datatree(dap4_urls[0], engine='pydap', session=my_session)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73e9d71c-df1f-4ced-86d2-0568ae89a580", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "ds['support_data/terrain_height'].isel(latitude=slice(lat_start, lat_end), longitude=slice(lon_start, lon_end)).plot()" + ] + }, + { + "cell_type": "markdown", + "id": "d8c0e9c1-8092-46d2-80d2-05f475571fb5", + "metadata": {}, + "source": [ + " **When slicing a Variable with Xarray (for a single granule)** Xarray passes down via pydap the request so that subsetting takes place close to the data. \n", + "\n", + "\n", + "* **When creating an aggregated view of the dataset, to ensure the server does the subsetting, the user must pass a chunk argument when creating the dataset.**\n", + "\n", + "For example:\n", + "```python\n", + "ds1 = xr.open_mfdataset(\n", + " dap4_urls, engine='pydap', \n", + " session=new_session, \n", + " parallel=True, \n", + " concat_dim='time', \n", + " combine='nested', \n", + " group='/',\n", + " chunk={'latitude': size_of_lat_slice, 'longitude': size_of_lon_slice}\n", + ")\n", + "\n", + "ds2 = xr.open_mfdataset(\n", + " dap4_urls, engine='pydap', \n", + " session=new_session, \n", + " parallel=True, \n", + " concat_dim='time', \n", + " combine='nested', \n", + " group='/support_data',\n", + " chunk={'latitude': size_of_lat_slice, 'longitude': size_of_lon_slice}\n", + ")\n", + "\n", + "ds = xr.merge([ds1, ds2])\n", + "ds['support_data/terrain_height'].isel(longitude=slice(lon_start, lon_end), latitude=slice(lat_start,lat_end)).to_netcdf('local_file')\n", + "```\n", + "\n", + " **If you do not chunk when creating the dataset, Xarray will download the ENTIRE variable into memory and then subset it, resulting in subpar performance and unnecessary data transfer (bug on Xarray).**\n", + "\n", + "\n", + "\n", + " For reference, check this observation on the pydap documentation: \n", + "* [Subsetting 2 OPenDAP URLs](https://pydap.github.io/pydap/en/5_minute_tutorial.html#case-2-subsetting-across-two-separate-files), in particular the observation made in this block:\n", + "* [How to pass the slice from Xarray to the remote Server](https://pydap.github.io/pydap/en/5_minute_tutorial.html#how-to-pass-the-slice-from-xarray-to-the-remote-server)" + ] + }, + { + "cell_type": "markdown", + "id": "cfaab203-a8fc-4f4a-860e-c90978d788ff", + "metadata": {}, + "source": [ + " **Subsetting by Variable Names**\n", + "\n", + " This can be useful when the vast majory of variables will be discarded. In particular when the granule has O(100-1000) variables (typical of Level2 datasets). In that scenario, simply openning an Xarray object can take ~ 10-100 seconds or more for a single granule, since Xarray needs to parse all the metadata.\n", + "\n", + " A URL can be constructed that tells the OPeNDAP server which variables (and their dimensions) a user is interested. \n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfb8d09c-dd9b-4407-a9bb-5c8705c9aa1f", + "metadata": {}, + "outputs": [], + "source": [ + "keep_variables = ['/longitude', '/latitude', '/time', \"/support_data\", \"/product/o3_below_cloud\"]\n", + "\n", + "CE = \"?dap4.ce=\" + \";\".join(keep_variables)\n", + "\n", + "dap4ce_urls = [url+CE for url in dap4_urls ]\n", + "dap4ce_urls[0]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "44ebb3dc-2a1f-4ce9-a4e6-8021e41c5934", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "dt = xr.open_datatree(dap4ce_urls[0], engine='pydap', session=my_session)\n", + "dt" + ] + }, + { + "cell_type": "markdown", + "id": "6d3b2e98-5050-4e46-85e0-f977efc69174", + "metadata": {}, + "source": [ + " **Recommended Worklow for Streaming Data**\n", + "\n", + "\n", + "* **Data Exploration**: `Xarray + PyDAP`. When interested in visualizing one or two variables. Identify regions of interest (interactively).\n", + "\n", + "However, if the goal is to stream data via OPeNDAP, Xarray+opendap requires extra logic to get the **best performance** that is stil subpar compared to **talking to the server directly**\n", + "\n", + "\n", + "* **Streaming a Data Subset**: Pure PyDAP to parallelize direct server (subset) requests. See below:\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7a5fbc55-d3a7-4d66-b94b-8fee84a3645c", + "metadata": {}, + "outputs": [], + "source": [ + "dim_slices = {'latitude': (lat_start, lat_end), 'longitude': (lon_start, lon_end)}\n", + "\n", + "print('Varibles I want to stream into local file: \\n\\n', keep_variables, '\\n\\n ========================================================================') \n", + "\n", + "print(\"\\npassing down to the server the following slice that will be applied to data variables of interest: \\n\\n\", dim_slices,'\\n\\n ========================================================================')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6879445c-cae7-4b7b-a729-f5f7e4e8c805", + "metadata": {}, + "outputs": [], + "source": [ + "# This is necessary to pass down credentials since requests.Session is not a pickable object\n", + "session_state = session_state = extract_session_state(my_session)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9098b2c2-463f-4f04-a915-4c01a7002799", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "stream_parallel(\n", + " cmr_urls[:4],\n", + " session_state,\n", + " keep_variables=keep_variables, \n", + " dim_slices=dim_slices,\n", + " output_path=\"./data/\",\n", + " max_workers=4,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "0da35e08-efd4-4186-9337-b370e27c3712", + "metadata": {}, + "source": [ + "## check the data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5c7d604e-e963-4f86-bd38-e17d732d8256", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "ds1 = xr.open_mfdataset(\"TEMPO_O3*.nc\", group=\"/\", parallel=True)\n", + "ds2 = xr.open_mfdataset(\"TEMPO_O3*.nc\", group=\"/support_data\", parallel=True, concat_dim='time', combine='nested')\n", + "ds3 = xr.open_mfdataset(\"TEMPO_O3*.nc\", group=\"/product\", parallel=True, concat_dim='time', combine='nested')\n", + "ds = xr.merge([ds1, ds2, ds3])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "dc6dbd48-93f6-45cb-9356-13dff00d59bd", + "metadata": {}, + "outputs": [], + "source": [ + "ds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "996af357-ed03-4aa3-bb0f-4c6c05f6132b", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1204acec-86ab-477d-b4fb-4ac33a228d1a", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/binder/earthaccess.ipynb b/binder/Xarray/earthaccess.ipynb similarity index 100% rename from binder/earthaccess.ipynb rename to binder/Xarray/earthaccess.ipynb diff --git a/binder/on-premOPeNDAP.ipynb b/binder/Xarray/on-premOPeNDAP.ipynb similarity index 100% rename from binder/on-premOPeNDAP.ipynb rename to binder/Xarray/on-premOPeNDAP.ipynb diff --git a/binder/environment.yml b/binder/environment.yml index 8833df5..704f045 100644 --- a/binder/environment.yml +++ b/binder/environment.yml @@ -1,4 +1,4 @@ -name: Earthdata2025 +name: Earthdata2026 channels: - conda-forge dependencies: @@ -15,9 +15,9 @@ dependencies: - pandas - scipy - xoak +- xarray = 2024.6.0 +- pydap = 3.5.9 - pip: - - git+https://github.com/pydap/pydap.git - - git+https://github.com/pydata/xarray.git - jupyter-contrib-nbextensions - ipywidgets - widgetsnbextension diff --git a/binder/GetStarted.ipynb b/binder/retired/GetStarted.ipynb similarity index 99% rename from binder/GetStarted.ipynb rename to binder/retired/GetStarted.ipynb index 4af9db7..59a18b1 100644 --- a/binder/GetStarted.ipynb +++ b/binder/retired/GetStarted.ipynb @@ -214,7 +214,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.11" + "version": "3.12.12" } }, "nbformat": 4,