TERCET NUTS-postal codes matching tables Methodological notes

Size: px
Start display at page:

Download "TERCET NUTS-postal codes matching tables Methodological notes"

Transcription

1 EUROPEAN COMMISSION EUROSTAT Directorate E: Sectoral and regional statistics Unit E-4: Regional statistics and geographical information TERCET NUTS-postal codes matching tables Methodological notes Date: Version: 1.00 Authors: Revised by: Approved by: Public: Reference Number: Eurostat 2017-GISCO-NUTS2013-PC-MET- NOTES-V1 Commission européenne, 2920 Luxembourg, LUXEMBOURG - Tel

2 Document History Version Date Comment Modified Pages Document created by GISCO All Contact: Page 2 / 8

3 1. INTRODUCTION The purpose of this document is to provide detailed information on the data sources and the methodology to create the TERCET NUTS-postal codes matching tables that are disseminated by Eurostat via the TERCET tool 1. It also provides an overview of known issues and limitations of these data due to the quality of the data sources and the tools used. The TERCET NUTS-postal codes matching tables contain a lookup-list of European postal codes and their corresponding NUTS codes. 2. METHODOLOGY The process to develop the TERCET NUTS-postal codes matching tables involved the following steps: (1) Acquisition of national postal code lists from public and commercial data; (2) Geocoding of postal codes (centroids); (3) Spatial join of postal code centroids to NUTS areas; (4) Extraction of matching tables Acquisition of postal codes GISCO has obtained lists of postal codes from a variety of sources, in order of priority: (1) Official member states' postal authorities and national postal service providers; (2) If this is not available, the postal codes from Geonames 2 are used; (3) If Geonames information is not available, then other free public data and commercial data sources (e.g. Wikipedia or TomTom ) are used. The reference year, if documented by the providers, of postal codes is The following sections provide an overview of the source data for postal codes per country and the available level if the most detailed level could not be used Official data from national postal services providers Country Source Level UK Geonames ( At lowest level. With the exception of Northern Ireland which is at Postal district level Geonames contains postal codes centroids for in total 22 EU28, EFTA and Candidate Countries. We assumed that the location of the postal code area centroids was quality checked. Hence they were directly used for the following spatial join with NUTS areas (see 2.3). Table 1: Completeness of Geonames 2016 data. Country Source listed by Geonames Level Records in Geonames Records from MS or 2010 data % Page 3 / 8

4 Country Source listed by Geonames Level Records in Geonames Records from MS or 2010 data AT Lowest level BE Lowest level BG Lowest level CH DE Amtliche Vermessung Schweiz / swisstopo sourceforge.net/projects/op engeodb Lowest level * Lowest level DK Lowest level * ES FI inmadrid.enredados.com/, postinumeroluettelo Lowest level N/A Lowest level * HU Lowest level IT easyreserve Lowest level * LT Lowest level * 4.19 LU Lowest level * NL nl.wikipedia.org Mid level last two characters missing e.g instead of 1000 AP % * NO Lowest level * PT Lowest level N/A SK Lowest level # SE program/prg00745.zip Lowest level N/A MT Lowest level * TR r/ IE Higher level, Eircode is a new postal code system, not every area has a postcode Lowest level * N/A HR FR Lower level excludes CED N/A 'Records in Geonames means the number of postal codes that were available from Geonames and could be used for TERCET. 'Records from MS or 2010 data' means the number of postal codes that are freely available from the Member State's postal authority, another official source, or were provided as part of the previous postal codes NUTS matching exercise for TERCET that took place in These figures served as benchmark for the completeness of the matching. Page 4 / 8

5 (*) data from 2010 TERCET flat files based on a previous collection (reference year 2010) of postal codes from national postal service providers. These files were outdated and the access and use conditions for several countries did not allow redistribution to the public. (+) Germany has 8230 records in TomTom. The 2010 figure includes post boxes, business/government postal codes which cannot be geographically located. This is also likely to be the issue for other countries with large mismatches between 2010 and 2016 data Geocoding of open data If the location/coordinates of postal codes were not shipped with the codes, either directly from the provider or from Geonames, Eurostat/GISCO has geocoded them using a mix of free and open geocoders and commercial geocoding data such as TomTom. Postal codes (without coordinates) were obtained from public and free sources. The lowest and most accurate level was used where available and possible. A script developed by Eurostat checks their location using one or more of the following geocoding servers: In addition, Eurostat's internal geocoding server which is based on Open Street Maps was used. The result is an internal dataset that contains the centre points of postal code areas for in total 9 EU28, EFTA, and Candidate Countries and potential Candidate Countries. The primary purpose of this internal dataset is to locate postal code areas in NUTS areas and thus match postal codes to NUTS codes for use in Eurostat's TERCET tool 3. It is not for dissemination outside the European Commission due to access and use restrictions of the input data. The resulting geographical coordinates were extracted. After this process is finished, a comparison is made between the results from the above geocoding servers using software such as ArcGIS and google maps service. Based on this comparison, additional corrections are made on this specific set of the postal codes, such as moving points that are located in water onto land. Table 2: Level of available postal codes and hits from geocoding per country. Country Level 2016 MS or 2010 EE Lowest level EL Lowest level IS Lowest level PL Lowest level N/A SI Lowest level * 89 RO Lowest level * 98.7 MK Lowest level * 99.5 MN Lowest level NA NA CY Lowest level * 99.7 % 3 Page 5 / 8

6 (*) data from 2010 TERCET flat files based on a previous collection (reference year 2010) of postal codes from national postal service providers. These files were outdated and the access and use conditions for several countries did not allow redistribution to the public Spatial join and table extraction The postal codes centroids together with their location as obtained from the three different sources and processes above were integrated into one single database. This postal codes database was spatially joined to the NUTS 2010 and NUTS 2013 areas as provided as "EuroBoundaryMap 4 by EuroGeographics", scale 1: Postal codes that were located outside our NUTS boundaries (e.g. in open waters) were matched to the closest NUTS belonging to the same country. The resulting code matches without the centroids have been extracted and are available as matching tables from the TERCET tool at TERCET has versions for the NUTS 2010 and the NUTS 2013 using postal codes as obtained in the 2016 from the various sources. 3. QUALITY ASSESSMENT Final and intermediate results were checked for the following criteria: (1) Spatial accuracy of the geocoding; (2) Formatting of postal code strings; (3) Completeness of the individual postal codes files (e.g. comparison of number of records against official postal code area information) Spatial Accuracy Approximately 50% of all the postcodes come from the UK. As this data is from the Ordnance Survey/UK Royal mail Eurostat considers them as official and accurate. For the remaining countries, spatial checks were carried out against the TomTom geocoding feature class. TomTom postcode points were extracted from the centre point of the lines making up the geocoding poly-line feature class. The centre point for each postcode area was then calculated from all the line centre points with the same postal code. For a sample of 5% of the postal code dataset, the distance was measured between the postal code centre point and its equivalent point in TomTom. For each country, we then calculated the average difference in distance. (Note: TomTom may not have full coverage/or only have higher level postal codes, so a comparison is not always possible). The following table shows the average distance between geocoded centroids and TomTom centroids per country. Table 3: Distance comparison Eurostat geocoding - TomTom data. Country AT 712 BE 2381 CH 888 DE 1374 Average distance (m) 4 Generalised versions of the NUTS boundaries can be obtained from the Eurostat website at Page 6 / 8

7 DK 2295 EE 953 ES 3396 FI FR 1312 HR 2429 HU 2138 IS IT 1506 LI 242 LT 6924 LV 2656 NL 250 NO 2939 Median of all countries 989 The largest disparity is in Iceland where the majority of points are closely matched to the TomTom point, but where a small number are incorrectly located which increases the average error. Although in this case, the NUTS code assigned is still correct as Iceland has only one statistical area. Figure 1: Extreme difference between the location of a postal code in TomTom and from geocoding. Lithuania also has a large error, so alternative data sources may be required. Such differences of 1km or more at NUTS level may in some cases lead to the wrong NUTS code being assigned. However for the 5% sample both sources have coded the same postal code to the same NUTS 3 code. Thus we assume that wrong geocoding is a minority case. As postal code areas are not nested inside NUTS areas, it cannot be fully avoided. Page 7 / 8

8 3.2. Formatting All postal codes have been checked for their correct formatting (allowed characters, string length, range) against available definitions. The matching tables' flat files were checked as well to avoid unwanted characters, spacing etc Completeness After the geocoding, the completeness of the postal codes was rechecked against official public information, or against commercial data (TomTom ) if no official data was accessible. This completeness check included removing post box postal codes with no meaningful location. 4. KNOWN ISSUES 4.1. Location The location for all postcodes will not be 100% accurate due to the quality of the source data and the limitations of geocoding. Tests on a sample have identified a few points located far from their true location, or for instance, in open water. However, these are a minority and in many cases will still be within the correct NUTS region Code format Numeric postal codes with leading '0', e.g are changed in the GIS program to a numeric value i.e. from "0001" to "1". Depending on the format and range of national postal codes this has happened in a number of countries and had to be corrected after the matching: CY, EL, FR, IT, LT, LU, LV, MK, NO, TR 4.3. Coverage Table 2 and Table 3 show differences in the number of postal codes between 2010 and As these postal codes were derived from open and TomTom data (with the exception of the UK) instead of directly from the custodians of postal code information, it is possible that some postcodes are missing. Very few Member States distribute a complete list of postal codes, making validation difficult in most cases. Postal codes are sometimes retired and new postcodes frequently created, so their completeness cannot be ensured. The larger differences in Germany, Lithuania, Croatia and Malta can, in part, be explained by the inclusion of large institutions such as governments with their own postal codes and post boxes. Page 8 / 8