From: Pierre Choffet Date: Thu, 28 Oct 2021 21:44:00 +0000 (-0400) Subject: From WMO servers to data validation X-Git-Url: https://git.choffet.net/?a=commitdiff_plain;h=fbe2925f3b6ff0ba4ffe718087666ea85fc0a822;p=wmo_to_wikidata.git From WMO servers to data validation First batch of the process, that does the following: - download stations metadata from WMO, and keep the local cache up to date - convert metadata into one of its XML equivalent - filter unwanted content and fix known mistakes - validate the resulting file The global structure is in place but final operations aren't there yet: more filtering will have to be done in the future. An incomplete README file has also been added. --- diff --git a/README b/README new file mode 100644 index 0000000..9538894 --- /dev/null +++ b/README @@ -0,0 +1,28 @@ +wmo_to_wikidata - Import World Meteorological Organization weather stations + metadata into Wikidata. + +This repository contains a set of scripts that download, clean, verify, compare +WMO stations metadata before importing it into Wikidata as needed. + +The original source code in this repository is sponsored by Wkimedia Canada. + +The following tools are required as dependencies - they should be available +prepackaged for most GNU/Linux distros. + + - Bash - https://www.gnu.org/software/bash/ + Shell script interpreter + + - Curl - https://curl.se/ + Download WMO data. + + - Xmlstarlet - http://xmlstar.sourceforge.net/ + XSD and XSLT processor. + + - Yq - https://kislyuk.github.io/yq/ + Jq wrapper to convert WMO JSON into XML. + + +The repositories contains the following tools: + - update.sh + Ensure WMO stations cache is up to date, convert original JSON into XML, + clean and validate data. diff --git a/README.md b/README.md deleted file mode 100644 index e687faf..0000000 --- a/README.md +++ /dev/null @@ -1 +0,0 @@ -# wikimedia-pilot \ No newline at end of file diff --git a/schemas/stations.xsd b/schemas/stations.xsd new file mode 100644 index 0000000..6a4013b --- /dev/null +++ b/schemas/stations.xsd @@ -0,0 +1,440 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/schemas/xml.xsd b/schemas/xml.xsd new file mode 100644 index 0000000..aea7d0d --- /dev/null +++ b/schemas/xml.xsd @@ -0,0 +1,287 @@ + + + + + + +
+

About the XML namespace

+ +
+

+ This schema document describes the XML namespace, in a form + suitable for import by other schema documents. +

+

+ See + http://www.w3.org/XML/1998/namespace.html and + + http://www.w3.org/TR/REC-xml for information + about this namespace. +

+

+ Note that local names in this namespace are intended to be + defined only by the World Wide Web Consortium or its subgroups. + The names currently defined in this namespace are listed below. + They should not be used with conflicting semantics by any Working + Group, specification, or document instance. +

+

+ See further below in this document for more information about how to refer to this schema document from your own + XSD schema documents and about the + namespace-versioning policy governing this schema document. +

+
+
+
+
+ + + + +
+ +

lang (as an attribute name)

+

+ denotes an attribute whose value + is a language code for the natural language of the content of + any element; its value is inherited. This name is reserved + by virtue of its definition in the XML specification.

+ +
+
+

Notes

+

+ Attempting to install the relevant ISO 2- and 3-letter + codes as the enumerated possible values is probably never + going to be a realistic possibility. +

+

+ See BCP 47 at + http://www.rfc-editor.org/rfc/bcp/bcp47.txt + and the IANA language subtag registry at + + http://www.iana.org/assignments/language-subtag-registry + for further information. +

+

+ The union allows for the 'un-declaration' of xml:lang with + the empty string. +

+
+
+
+ + + + + + + + + +
+ + + + +
+ +

space (as an attribute name)

+

+ denotes an attribute whose + value is a keyword indicating what whitespace processing + discipline is intended for the content of the element; its + value is inherited. This name is reserved by virtue of its + definition in the XML specification.

+ +
+
+
+ + + + + + +
+ + + +
+ +

base (as an attribute name)

+

+ denotes an attribute whose value + provides a URI to be used as the base for interpreting any + relative URIs in the scope of the element on which it + appears; its value is inherited. This name is reserved + by virtue of its definition in the XML Base specification.

+ +

+ See http://www.w3.org/TR/xmlbase/ + for information about this attribute. +

+
+
+
+
+ + + + +
+ +

id (as an attribute name)

+

+ denotes an attribute whose value + should be interpreted as if declared to be of type ID. + This name is reserved by virtue of its definition in the + xml:id specification.

+ +

+ See http://www.w3.org/TR/xml-id/ + for information about this attribute. +

+
+
+
+
+ + + + + + + + + + +
+ +

Father (in any context at all)

+ +
+

+ denotes Jon Bosak, the chair of + the original XML Working Group. This name is reserved by + the following decision of the W3C XML Plenary and + XML Coordination groups: +

+
+

+ In appreciation for his vision, leadership and + dedication the W3C XML Plenary on this 10th day of + February, 2000, reserves for Jon Bosak in perpetuity + the XML name "xml:Father". +

+
+
+
+
+
+ + + +
+

About this schema document

+ +
+

+ This schema defines attributes and an attribute group suitable + for use by schemas wishing to allow xml:base, + xml:lang, xml:space or + xml:id attributes on elements they define. +

+

+ To enable this, such a schema must import this schema for + the XML namespace, e.g. as follows: +

+
+          <schema . . .>
+           . . .
+           <import namespace="http://www.w3.org/XML/1998/namespace"
+                      schemaLocation="http://www.w3.org/2001/xml.xsd"/>
+     
+

+ or +

+
+           <import namespace="http://www.w3.org/XML/1998/namespace"
+                      schemaLocation="http://www.w3.org/2009/01/xml.xsd"/>
+     
+

+ Subsequently, qualified reference to any of the attributes or the + group defined below will have the desired effect, e.g. +

+
+          <type . . .>
+           . . .
+           <attributeGroup ref="xml:specialAttrs"/>
+     
+

+ will define a type which will schema-validate an instance element + with any of those attributes. +

+
+
+
+
+ + + +
+

Versioning policy for this schema document

+
+

+ In keeping with the XML Schema WG's standard versioning + policy, this schema document will persist at + + http://www.w3.org/2009/01/xml.xsd. +

+

+ At the date of issue it can also be found at + + http://www.w3.org/2001/xml.xsd. +

+

+ The schema document at that URI may however change in the future, + in order to remain compatible with the latest version of XML + Schema itself, or with the XML namespace itself. In other words, + if the XML Schema or XML namespaces change, the version of this + document at + http://www.w3.org/2001/xml.xsd + + will change accordingly; the version at + + http://www.w3.org/2009/01/xml.xsd + + will not change. +

+

+ Previous dated (and unchanging) versions of this schema + document are at: +

+ +
+
+
+
+ +
+ diff --git a/update.sh b/update.sh new file mode 100755 index 0000000..dd6df55 --- /dev/null +++ b/update.sh @@ -0,0 +1,65 @@ +#!/bin/bash + +# update.sh - Scripts to merge WMO data with Wikidata. +# Copyright (C) 2021 Pierre Choffet +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of version 3 of the GNU General Public License as published +# by the Free Software Foundation. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . + +set -euxo pipefail + +# Script cache dir +CACHE_DIR=${CACHE_DIR:-"${HOME}/.cache/wmo_to_wikidata/"} + +# Any stations cache older than this (in minutes) will be updated +STATIONS_MAX_AGE=${STATIONS_MAX_AGE:=1440} + +# Hardcoded values +OSCAR_STATIONS_URL='https://oscar.wmo.int/surface/rest/api/search/station' +STATIONS_CACHE_PATH="${CACHE_DIR}/stations.xml" +STATIONS_CLEANED_CACHE_PATH="${CACHE_DIR}/stations_cleaned.xml" + +# Fail if something is missing +function assertEnvironment() { + for name in curl yq xmlstarlet + do + if ! type "${name}" > /dev/null 2>&1 + then + echo "Cannot find ${name}. Exiting" + exit 1 + fi + done +} + +# Update stations cache, if needed +function ensureStationsCache() { + local -r outdated_path=$(find "${STATIONS_CACHE_PATH}" -mmin "+${STATIONS_MAX_AGE}") + + if [ ! -f "${STATIONS_CACHE_PATH}" ]||[ "${outdated_path}" != '' ] + then + local -r stations_download_path="$(mktemp)" + + mkdir -p "${CACHE_DIR}" + curl "${OSCAR_STATIONS_URL}" > "${stations_download_path}" + echo "$(yq -x --xml-root station .stationSearchResults "${stations_download_path}")" | xmlstarlet fo -t > "${STATIONS_CACHE_PATH}" + rm "${stations_download_path}" + fi +} + +assertEnvironment +ensureStationsCache + +# Clean stations cache for known problems +xmlstarlet tr -s xslts/stations_clean.xslt "${STATIONS_CACHE_PATH}" | xmlstarlet fo -t > "${STATIONS_CLEANED_CACHE_PATH}" + +# Validate stations cache +xmlstarlet val -e -s schemas/stations.xsd "${STATIONS_CLEANED_CACHE_PATH}" diff --git a/xslts/stations_clean.xslt b/xslts/stations_clean.xslt new file mode 100644 index 0000000..5971d2e --- /dev/null +++ b/xslts/stations_clean.xslt @@ -0,0 +1,61 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +