Standing Committee Meeting Report Data Services

March 2021

Open Action Items

[F2018:2] Develop a community letter supporting archiving historical data at the National Archives

Responsible: Dave Wilson

Status: (March 2019) Wilson spoke with the USGS Records Disposition Coordinator regarding a community letter in support of NARA accepting the WWSSN film chips.  He said he didn't think a letter is necessary at this time.  But if he hits a wall then he may be willing to consider it.  

(Oct 2019) Hit a brick wall.  Participants at the Historic film chip workshop worried about NARA not releasing from archive.  Available at ASL in air-conditioned containers.  There was some concern about being able to maintain cost of climate control.  ASL has found a way to contain costs based on where they are housing the film chips. 

(March 2020) SSA considering the issue through a SIG.

(March 2021) In an email to Lorraine Hwang, Jerry indicated IRIS support to the idea of rescuing data and would support where appropriate.

 

[S2020:1] Investigate inclusion of larger models in EMC in coordination with UNAVCO and report to DSSC. 

Responsible: Trabant

Status: (October 2020) An exploration of handling large models has led to two potential changes: 1) allow models in the netCDF 4 formats, which includes data compression and 2) adopt the THREDDS Data Server (TDS) for managing the netCDF repository.  Change #1 has been completed, models in netCDF 4 are allowed and solved the immediate issue.  Change #2 of adopting TDS is still in exploration, with a future cloud-friendly requirement in mind.  UNAVCO was consulted on their potential use or need for similar data management and there was none.

(March 2021) Evaluation of TDS was delayed while waiting for a release that supports object storage (and therefore usable in a future cloud platform).  Evaluation resumed once a new beta release became available, initial testing is positive and we will continue evaluation.

 

[F2020:1] Enhance the ability of authors to cite data by 1) improving instructions to authors on webpages, 2) promotion through IRIS newsletter, 3) informing journals of the citation services provided, 4) investigating the use of tags on data distributed to users.

Responsible: Carter

Status: (March 2021) Instructions to authors on IRIS and FDSN web pages have been updated and an article on citation was included in the winter newsletter. Items 3 and 4 have not yet been addressed.

 

[F2020:2] Develop a DRAFT Data Licensing policy in coordination with major funding agencies and UNAVCO; consider legal advice.

Responsible: Carter

Status: (March 2021) The Joint Data Services committee of IRIS and UNAVCO met in January 2021 to discuss data licensing. It was recommended that no restrictions be imposed on the organization regarding the data that might be accepted. Carter has reached out to FDSN to discuss the issue with the executive committee.

 

[F2020:3] Develop a DAS data directive that provides a consistent approach to requests for storing DAS data in the data repositories.

Responsible: Carter

Status: (March 2021) IRIS has submitted a MSRI design proposal to address DAS data storage.

 

[F2020:4] Investigate large data management practices employed by the Vera C. Rubin Observatory. 

Responsible:  Carter

Status: (March 2021) Suzan provided information to DS, who should follow up on the information. DS is following up and should be made the responsible party.

 

[S2021:1] Investigate a way of finding data that “looks different”.  What are the most common reasons for which data are tossed? Link to new “funny squiggles” paper (Ringler et al. 2021 preprint).

Responsible: QAAC

 

Brief Meeting Summary 

After opening and introductions of new members (Marine Denolle, Jonathan MacCarthy, and Board Liaison Sarah Kruse), the minutes from the Fall 2020 DSSC meeting, actions item status, DSSC charge recommendations, QAAC charge revisions, and DSSC Policies were approved.  Small changes to the DSSC charge were recommended for Board approval and the changes to the QAAC charge recommended by the QAAC were accepted.

Status presentations were made by the Data Services (DS) staff and the major data collection centers (DCCs). The Director praised the close working relationship with UNAVCO Data services, announced the hiring of a software engineer (Mike Stults), mentioned the 100% uptime metric over the past 6 quarters, announced the completion of the Concept of Operations for the CCP project, and announced the attainment of Core Trust Seal certification. Data ingestion was shown to be reduced over the last three quarters, which was attributed to the paucity of temporary experiments being deployed and delayed station maintenance as a result of the pandemic. Nearly one PiB of data were distributed in 2020 and this is expected to be surpassed in 2021. It was noted that cooperative activities with members of the international community were very active and continue to provide an international effort to improve standards for data formats and exchange as well as standards for data center federation.  David Mencin, the Director of Data Services for UNAVCO, provided a description of the challenges faced by IRIS and UNAVCO in accommodating diverse data types and their staging in the CCP, security issues that are being faced, and the cooperation between the staff on issues like new data containers.

The next few status presentations were from Albuquerque Seismic Lab (ASL), the IDA team at UCSD, and the PASSCAL DCC. Adam Ringler reported that ASL continues to upgrade GSN sensors, but this work has been slowed by the pandemic.  They also began installing magnetometers with seismic stations, which can be used to help remove long-period seismic noise and improve geomagnetic spatial resolution.  A small study was begun to investigate improving ray coverage through improved station location.  Finding and characterizing digitizer timing errors is made easier using the “gns_timing” metric, which is now a contributed MUSTANG metric. ASL is also comparing seismic observations of Earth tides to geodetic for indications of seismic fidelity.  Rob Mellors from the IDA DCC reported that station visits have been impacted by the pandemic, but that some remote collaborations with local operators are working well.  The primary stream for station data is now to AWS with a secondary stream directly to UCSD and this new configuration is working well.  IDA is placing a renewed emphasis on cybersecurity. The potential timing issue for Q330s was described but it is not expected to cause any problems.  Finally, a MOU has been signed for new site in Uzbekistan.  Bruce Beaudoin from the PASSCAL Instrument Center reported 34 new experiments in 2020 (down ~50%) with 60 carrying over. Some experiments will have delayed starts, and some have had to delay equipment returns.  PASSCAL launched its MT program and is working with DS on data management issues.  There is a hope that ~2,000 new nodes will be available for community use by mid-summer. PASSCAL migrated their software suite to Python 3 and has a conda channel for PASSCAL Software Suite. The Deputy Director of Operations, Rick Benson, reported on DS’s progress in merging its data statistics and on efforts to identify a cloud provider. Deputy Director of Quality Assurance, Gillian Sharer, reported on the activities of her group.  Three new metrics are now available to users, including gsn_timing, which is an ASL-contributed metric. There is an issue with PSDs in that the day boundaries are not spanned; the QA team is working on a solution. New quality tools and reports have either been or are expected to be released and include Quality Assurance Report Generator (QuARG) and PIQQA to help PIs quickly assess data quality at experiment end. Work continues to modernize the Nominal Response Library. Vedran Lekic reported on the QAAC activities including trying to engage the National Strong Motion Project for input on metrics and considering what tools/interfaces an end-user (e.g. grad student) needs to assess quality (in contrast to network operators, for example). “Data filters” and “research-ready datasets (rrds)” were mentioned as examples.  Ved also emphasized the diverse set of expertise (researchers, network operators, many others) on the QAAC and encouraged engagement of the QAAC with QA questions/tasks/challenges. A Cyberinfrastructure report was given by the Deputy Director, Rob Casey, who informed the committee about a prototype web tool to find DOIs for networks; web service for Marsquakes (soon to be released); staff engagement in CCP design; and improvements in Fed Catalog web service performance for large POST requests. The Deputy Director of Architecture and Products, Chad Trabant, reported on several projects involving international cooperation including completion of StationXML documentation; a draft miniSEED3 specification; a next generation of the SeedLink protocol (v4) that accommodates the new miniSEED and supports authentication and encryption. Product progress was marked by releasing an aftershock product code to GitHub; working on Mars synthetic waveforms available via Syngine service that will be connected to the Marsquake service; and evaluating the THREDDS Data Service for serving models in the EMC.

A report was given by IRIS President Bob Woodward on the status of the merger and on the runup to the new facility solicitation. This was followed by a description of new proposals that have (or might have) an impact on Data Services. These include a MSRI design pre-proposal for a DAS data repository, a supplemental proposal for the CCP, a MSRI pre-proposal for the SZ4D effort, and a Rupture Zone Fault Observatory proposal.

Chad Trabant gave a presentation on the progress of the CCP project and the related GeoHDF project. In the discussion that followed it was made clear that GeoHDF is a model and that the limitations of HDF5 are known.  Jerry Carter then briefed the committee on the progress that IRIS and UNAVCO are making on fulfilling NSFs request for data usage statistics. 

The DSSC discussed DMC priorities from various perspectives (DS, the community, and the extended seismo-geodetic community). The existing priorities are (a) infrastructure; (b) core services (ingestion /archive/ QA/ discovery); (c) support for standards/products/developing world/training.  It was recognized that the geodetic community relies on products and that some products should be core services. There was also a recognition of the need for community training on computation close to data, data access, and AI applications as the DS systems are moved into a cloud environment.

The committee considered ways to improve the two-way communication between themselves and the community. Several suggestions were made to utilize existing slack workspaces and discussion communities to increase bi-directional communication.

The budget for Year 4 of SAGE-II was presented and endorsed unanimously by the committee.