The text on this page was automatically translated and hence may differ from the original. No rights can be derived from this translation.
In 2017, Dialogic carried out a project for Rijkswaterstaat to analyse concept lists of municipalities. The aim was to explore which data could be used and the steps needed to analyse this data. The government and the VNG have requested to refresh and expand this research. The government aims to simplify and integrate the regulations for spatial development with the Omgevingswet. This research focuses on harmonising Omgevingswet concepts between municipalities using text-mining and machine learning methods.
Text-mining helps to structure unstructured datasets. Using various distance metrics like the Jaccard similarity (figure 1), we can calculate how much the texts differ from each other for large amounts of text. The Jaccard similarity is used not only in text-mining but also in applications such as plagiarism detection or recommender systems.
Take the term "event" for example. Two different municipalities use slightly different definitions of this term:
Municipality 1: all entertainment performances open to the public, including: ...
Municipality 2: every entertainment performance open to the public except: ...
When calculating the Jaccard similarity, we calculate the ratio between the intersection (number of words that appear in both set 1 and set 2) and the union (the number of unique words). The table below illustrates how the intersection between the two sets is calculated.
[Table 1 image]
The Jaccard similarity is '1' when the two sets contain the same elements and '0' when no elements match. In this case, we see a total of 12 unique words in the two combined sets. Additionally, these two sets share 6 words. The Jaccard similarity between these two sets is 0.5.
[Table 2 image]
After calculating the Jaccard similarity, a cut-off value needs to be determined. This cut-off value is set to ensure a high level of certainty that two concepts are equivalent. Once text-mining methods identify which concepts closely align, machine-learning algorithms can classify new future concepts. By labelling the unstructured concept list, predictive models like a Support Vector Machine or Deep Neural Network can determine how closely a new concept matches one of the standardised concepts. This not only structures existing concepts but also provides a framework for the formation of new concepts.
The developed methods have many applications beyond the Omgevingswet research. For example, the Jaccard distance metric can be used to compare any possible set of elements and is not limited to (though very useful for) text-mining purposes. Machine-learning algorithms can be applied to any classification problem. You can test some of these algorithms yourself at our [Data Science and Dashboards](https://www.dialogic.nl/dienstverlening/data-science-dashboards/) page. An example includes a neural network predicting the function of an educational vacancy.


