Standing Committee Meeting Report Data Services
Open Action Items
[F2020:1-2] Enhance the ability of authors to cite data by 1) improving instructions to authors on webpages, 2) promotion through IRIS newsletter, 3) informing journals of the citation services provided, 4) investigating the use of tags on data distributed to users. Develop a DRAFT Data Licensing policy in coordination with major funding agencies and UNAVCO; consider legal advice.
Status: (March 2021) Instructions to authors on IRIS and FDSN web pages have been updated and an article on citation was included in the winter newsletter. Items 3 and 4 have not yet been addressed. The Joint Data Services committee of IRIS and UNAVCO met in January 2021 to discuss data licensing. It was recommended that no restrictions be imposed on the organization regarding the data that might be accepted. Carter has reached out to FDSN to discuss the issue with the executive committee.
(October 2021) The UNAVCO DS governance committee is recommending that a workshop proposal be prepared to address these issues and inviting IRIS to be involved in this workshop. This item is expected to be linked to the citation workshop. In addition to this, the FDSN has been approached about preparing a policy statement about licensing with a recommend that all metadata be in the public domain and all data be either in the public domain or be minimally licensed to require attribution.
(March 2022) The UNAVCO DS governance committee chairpersons (Julie and Suzan) have started to organize a workshop on Data Citation/Licensing.
[F2020:3] Develop a DAS data directive that provides a consistent approach to requests for storing DAS data in the data repositories.
Status: (March 2021) IRIS has submitted a MSRI design proposal to address DAS data storage.
(October 2021) The MSRI proposal was not asked to continue to the second round. More community involvement is needed and this will be sought through a community workshop.
(March 2022) A draft proposal for a community workshop has been introduced to the DAS RCN working group on Data Management. Members of the organizing committee are being sought. It should be noted that IRIS and UNAVCO are continuing to work on better metadata/format that would be appropriate for accepting DAS data. This does not, however, solve the data volume issues.
[S2021:1] Investigate a way of finding data that “looks different”. What are the most common reasons for which data are tossed? Link to new “funny squiggles” paper (Ringler et al. 2021 preprint).
Status: (October 2021) The QAAC is considering this at their next meeting (winter 2021).
(March 2022) No progress. The QAAC is still planning to discuss.
[S2022:1] Wordsmith a list of principles for the data services governance committee and share it with Julie E., David M., Jerry.
Responsible: Jonathan A.
[S2022:2] Begin to coordinate dates/times for a regular joint governance meeting beginning in May.
Hybrid Meetings on 8 and 9 March 2022
Present in person: Heather Ford, Ebru Bozdag, Jonathan MacCarthy, Suzan van der Lee, Sarah Kruse, Rob Casey, Chad Trabant, Rob Mellors, Bruce Beaudoin, Dan Auerbach
On zoom: Jerry Carter, Gillian Sharer, Rebecca Rodd, Adam Ringler, Jonathan Ajo-Franklin (joined after approval of minutes), Julie Elliott, David Mencin.
Approval of Fall 2021 Minutes:
The minutes from the Fall 2021 DSSC meeting were approved.
Policy Review – (Carter)
No changes to policies recommended at this point, particularly with upcoming policy revisions expected resulting from merger with UNAVCO.
Director’s Report – (Carter)
The Director gave a summary of the operational status of the Directorate:
- The DMC streak of 100% uptime for 10 quarters was broken in January, when the uptime dropped to 99.98%.
- Prototyping for the CCP is progressing on AWS. A demonstration system is expected by end-of-year.
- An interim identity management system is being developed in parallel with UNAVCO and will be implemented in June.
- Federated data requests will require User/Identity standards across data centers, and DS is working with international partners to come together on standards.
- Interim user/usage statistics are being developed using systems to capture what data are being used and by whom based on IP addresses.
- The DMC continues to improve system security.
- The use of existing hardware is being extended through maintenance to avoid new acquisitions prior to CCP cloud migration.
- The Quality Assurance section released ISPAQ v3.
- Earth Model Collaboration now has 140 models - a great success.
- miniSEED3 and SEEDlink proposals are slowly working their way through FDSN.
- Searches for a Cloud Administrator / Operations Engineer, a Cloud Data Engineer, and a Deputy Director of Infrastructure are ongoing.
Annual Program and Budget, SAGE-II Year 5 – (Carter)
The Director explained the SAGE-II Year 5 annual plan and budget, which is slightly lower than the original NGEOS proposed budget. Less money is being spent on equipment in anticipation of our migration to the cloud. On the second day of the meeting, the committee endorsed the budget.
Section Status Reports: Operations – (Sharer/Trabant)
- Director noted the great efforts to step in to fill the role left by the passing of Rick Benson.
- Usage/Statistics & reporting project
- Online in August 2021
- Python-based reporting system (“Shipments”) to allow for monitoring/usage. A dashboard allows for on-demand monitoring.
- Temporary network data are ~¼ of shipments. “Other networks” is almost another ¼. This is a lumping of all networks representing < 0.8% of requests.
- FDSN dataselect is the most popular request interface by far.
- Storage/shipments for US-funded data (archive is 70% US-funded data). Historical accounting completed.
- IRIS’s Apollo Server instance was decommissioned; instead, we use one run by Nanometrics to simplify IRIS maintenance/administration.
- Q: What caused the increase in PH5 dataselect requests?
- Beaudoin suggested that this is related to an increase in nodal data availability.
Section Status Reports: Quality Assurance – (Sharer)
- ISPAQ v.3 released on GitHub
- Can write results to SQLite3 database
- Can be a source of data for QuARG (Quality Assurance Report Generator)
- In Python 3, Jupyter notebook tutorials
- Nominal Response Library (NRL) modernization project
- New database fully populated, but responses need to be verified against the original NRL.
- Old NRL to be supported for a year to maintain support for PDCC until PDCC can be updated for new NRL.
- Ability to include “integrated” data-logger + sensor responses.
Section Status Reports: QAAC report – (Sandvol /Sharer)
- The QAAC winter meeting was held on March 1st; the new Chair is Eric Sandvol
- FDSN QA efforts ongoing
- MUSTANG + Matlab scripts will become available at IRIS website soon
- Consider the use of synthetics in MUSTANG QA, as is done with ALS metrics software
- COVID-19 impacts on data availability: supply chain and international travel issues resulted in some diminished data availability.
- Next QAAC meeting will start to discuss merging of QA efforts in the cloud with respect to the upcoming merger.
- New QAAC Action Items include requesting a presentation from UNAVCO on their QA and archiving practices, assembling feedback from members about QAAC governance in EarthScope, and beginning discussion of QA for DAS, MT, etc. data (noting that there are currently no MT or DAS users as members (but one observer)
Section Status Reports: Cyberinfrastructure – (Casey)
- The efforts of the cyberinfrastructure section are devoted to co-developing for CCP in AWS with UNAVCO (see the CCP agenda item) as well as implementing an interim identity management system (see the Interim Identity Management agenda item)
- Nominal Response Library (*see the QA report)
- Ten consecutive quarters of 100% uptime. IRIS is / will be in contact with AWS to discuss Service Level Agreements (SLAs) to maintain a high level of uptime in CCP. DMC noted that things like Earthquake Early Warning and real-time needs will raise the need for high uptime in an EarthScope cloud infrastructure.
- Q: If some services need to be real-time, will everything need to be real-time? DMC response: There is a strong need to avoid system deduplication in the CCP. Work is ongoing to identify which system components need real-time-level SLAs and how they could integrate with other non-real-time system components.
Section Status Reports: DMC Architecture and Products – (Trabant)
- miniSEED3 and streaming SEEDLink protocol are in the evaluation phase at FDSN
- The BackProjection product has been rebuilt in Python 3, is on GitHub, and made “cloud-ready”. Cloud-ready = port to a single language, cleaner inputs/outputs.
- Earth Model Collaboration (EMC) now supports alternate coordinate systems, such as UTM, not just lat/lon.
- Q: is there consideration to adopting standards and tooling for gridded data that are used more broadly than seismology? DMC response: yes, netCDF and CF conventions have this in mind, but we expect to consider this further with the merger with UNAVCO.
- Source Time Function data product also revisited, ported to Python 3, made more cloud friendly.
- Comment (Van der Lee): these are an opportunity to grow community expertise (students, interns) in porting research applications to something more cloud friendly.
- The end of Earthscope Automated Receiver Survey (EARS) measurements is anticipated due to code maintenance & security issues. Proposal to deprecate EARS (turn off the calculations) but we are soliciting input on what/how parts of it could be maintained.
- Committee response:
- EARS is still useful for education, large-scale surveys, initial crustal info for new studies
- Perhaps Docker-ize this application so that it can continue to be used as a benchmark for reproducibility
- Ability to add new stations could still be useful
- Consider keeping a repository of RFs from EARS for future applications.
- DMC will generate a preservation plan for EARS results, Docker-ize the app, then turn off the calculations.
- DMC using THREDDS data server to manage and serve EMC, working on a public web service for users.
- The products section still tracks data product citations, but the results are likely under-representative because data citation itself is a complex topic.
- Q: is there someone at the DMC who could participate/champion data citation as part of a data citation workshop? DMC response: Manoch or Adam Clark – ESIP is also a great source.
DCC Reports: ASL DCC – (Ringler)
- COVID still delaying maintenance on some international stations.
- STS-1 replacements with STS-6 ongoing. Great improvement in horizontal noise.
- GSN timing metric is sent to DMC; a number of timing issues for non Q330 stations have been found. Errors of same order as data required for velocity models.
- Q (Dan A.): many stations have more than 10% clock error. This means that clock quality was < 60% more than 10% of the time? ASL response: yes. We have no reason to believe there are cases of reported good clock quality but actual bad timing. Using PKIKP from repeating events, but tricky.
- Large improvement of <1mHz spectra from large (> Mw8) events from STS-6.
- Tonga event globally observable in GSN seismic and pressure sensors. See a 3.7 and 4.4 mHz excitation of atmosphere coupled with ground. Afforded by high quality of GSN data.
DCC Reports: IDA DCC – (Mellors)
- 95% of IDA stations are using AWS cloud
- Maintenance issues are accumulating due to COVID restrictions
- DGAR lost GPS (so bad timing) but data still coming. +/- 5 seconds accuracy may be possible with effort.
- KWJN unknown issues.
- Calibrations are up to date, Q330 firmware updates almost complete
- Want to get AAK, RPN, ALE high-frequency data, and two SeedLink servers to AWS Cloud
- Investigating link between CTBTO and AWS cloud
- Cybersecurity is a continuing and increasing concern. Occasional changes in compliance stance can lead to threat of system down time.
- Proposed new site: JZAX, Uzbekistan
DCC Reports: PASSCAL Instrument Center Report – (Beaudoin)
- Fairly robust experiment footprint, but still impacted by COVID (can’t ship to “level 4” countries). Looking more normal, however, compared to the previous two years.
- Uptick in node usage driving much higher PH5 delivery (projected 6x over previous year).
- Magneto-telluric (MT) data: building out facilities and tools for these data. Planned short course at IRIS Workshop
- Still awaiting updated Q330 nominal response @100, 200sps from Kinemetrics/Quanterra.
- Comment: several present expressed concern about Kinemetrics’ responsiveness to Q330 issues.
Project Reports: CCP Project update (Trabant)
- In December, the CCP timeline was updated (prototype by Dec 2022). The new timeline supports UNAVCO facilities/lease timeline.
- 4 Teams:
- GeoCrate. Data container for GNSS data. Also considering the time series use case (PH5 replacement, DAS in the future).
- Metadata. The system to organize, serve metadata. Very complex with multiple input data types/formats.
- Core platform data flow. The skeleton of the main pipeline for data movement.
- Infrastructure. Definition and management of the platform.
- Q: How can IRIS get early feedback on technical choices from a user perspective? A: IRIS will be reaching out for feedback from some users this year, but not at the expense of forward motion to meet deadlines.
- Q: How will the CCP ingest large volumes of data (e.g DAS)? A: Hope to be able to stream data into the archive, as opposed to delivery of huge chunks of data.
- Q: What are the plans for real time data access? A: We plan to support the same export mechanisms that IRIS and UNAVCO currently have.
- Q: How was AWS decided as the cloud provider? A: The cost-benefit analysis, accelerated timeline, and UNAVCO’s expertise in AWS.
Project Reports: Interim Identity Management (Casey)
- Funders of many data repositories are asking what data are being requested, by whom, and for what purpose
- Currently, who is determined via IP address, what: bytes per network, purpose: citations in literature
- single sign-on gateway via federated identity providers
- Login restricted to seismic data downloads
- Low-volume “public interest” access can still be anonymous (e.g. kiosk displays, education and workshops, some apps like Station Monitor and WILBER). Apps might need to register for this Public Interest exception.
- Use the FDSN-WS `queryauth` endpoint.
- User registration “sector of work” can help with the Purpose requirement.
- Some web services will change over in the next few months
- Comment: several members noted the problems associated with automated data request codes and the need to establish credentials periodically.
- Comment: members expressed the need for strong communication/messaging of this change, over multiple communication pathways (email, Twitter, etc.), and direct to developers of common/popular web services clients.
- Q: Will MUSTANG fall under ID management? A: not initially, but perhaps in the CCP, if it is simpler to manage.
- Comment: if apps need to register for an exemption, maybe other uses could as well. E.g., reproducible research related to a publication.
Project Reports: Interim Statistics (Carter)
- Already have the database designed/implemented to collect IDM statistics, ready for identities, but can already provide some statistics using IP addresses.
- % of data in SAGE from US-funded networks through time: about 50% permanent, ~25% temporary (for the last decade or so).
- International data sources are about as important to US-funded requesters as US data sources are.
- By age, real-time applications drive usage up in statistics, but older data popularity is nearly uniform.
- Comment: it will be important to be clear & transparent about what the ID usage reporting will look like.
Project Reports: Citation and Data Licensing (Van der Lee)
- Workshop proposal being developed with UNAVCO Data Services Advisory Committee, including international and journal participation.
- Licensing and citation are inter-related issues
- Invite relevant stakeholders: Shelly Stall (ESIP), Michelle G.
- Important to reach out as broadly as possible for participation.
Board Discussion Items (Van der Lee)
Review and/or revise recommendations/considerations on the governance structures and/or practices needed for EarthScope and communicate/coordinate with the UNAVCO DSAC.
- Recommendation from DSSC to circulate both DSSC and DSAC charges to respective members prior to a joint meeting (see below), in preparation for a discussion on future governance structure following merger.
- Board: there will be input for governance from an upcoming joint IRIS & UNAVCO Management & Governance workshop (April 28-29, VIPs + committee chairs), but bottom-up input from committees is still critical.
- Outcome of the workshop will be a top-level management structure recommendation.
- DSSC proposes developing a draft merged DSSC + DSAC charter before the April workshop. Not necessarily word-smithed but could be high-level (principles).
- Brainstorming DSSC principles:
- Advise on any aspect of the Data Services program
- Represent the Community of Data Services Users to the DS program and vice versa.
- Diversity of representation on the committee (by scientific discipline, gender, ethnicity, etc.)
- Provide guidance on new initiatives, where DS should consider growth, opportunities, etc.
- Promote transparent budget and advise on budget priorities.
- ACTION ITEM (Jonathan A.): wordsmith this list of principles, share with Julie E., David M., Jerry.
Consider plans for the supplemental two years of the SAGE-II (FY24 and FY25) award period. In particular, do you envision specific opportunities for innovation in your programs, whether in the area of operations and maintenance (e.g., investments or operating strategies that could reduce out-year operating costs) or the specific capabilities and services provided by the program? Please note that the NSF is very keen to see innovations in the SAGE and GAGE facilities and may even be willing to invest additional funds in the facilities to realize such innovations.
- Brainstorming topics for innovation:
- CCP is a massive innovation that allows a suite of secondary innovations:
- Flexibility for multiple data/metadata formats from seismology, geodesy, and new types of data such as DAS
- Ability to handle increasing data volumes (nodes, high-frequency GNSS and DAS)
- Proximity to cloud compute resources
- Find ways to provide a bridge/support/training between the massive data archive and the researchers/other users/computation. Education & training of researchers will be key:
- SCEC compute committee considering similar issues. Possible opportunities for collaboration.
- Develop within ROSES (via IRIS EPO); this can also be a testing ground for cloud-based access tools.
- Involve graduate students in providing the education & training, develop documentation. Employ graduate student interns at DMC to build valuable skills that are broader than just geophysical research and to build affinity with and understanding of DS.
- SCEDC already has seismic data in AWS. opportunity to use these data to help train students, community members prior to CCP.
- Note: UNAVCO outreach directorate is where some of these outrage/education efforts live in UNAVCO. Potential to collaborate/align with them.
- Large N ingestion & preparing for this (streaming, DAS, ubiquitous sensors like MyShake).
- Storage and distribution of legacy data (analog data from microfilms and microfiches).
Discuss & plan a joint meeting of the IRIS DSSC & UNAVCO DSAC (per the recent message from Aster and Grapenthin).
- An additional (virtual) meeting along with normal meeting cycles.
- Regular cycle of joint meetings. Begin sometime in May.
Update DS lists of priorities for the use of carryover funds or any end-of-year funds that we might receive from NSF. This information will assist in planning the allocation of any Y4 variances and will help us look ahead to Y5 (and beyond).
- The top priority is to apply excess funds to the successful completion of the CCP.
- It is also proposed to allocate some resources for ROSES/SCOPED cloud-based educational efforts and involve graduate students to organize learning materials.
- Move certain big or representative data users to the cloud and evaluate the impact, including costs, as an on-ramp with a pilot group of users.