Hackathon on data science for STI policy

A STIP Data Lab and OECD-TIP event | 23 May – 7 June 2022

By Andrés Barreneche and Jan Einhoff
OECD Directorate for Science, Technology and Innovation
19 July, 2022

The OECD recently held its first hackathon on data science for Science, Technology and Innovation (STI) policy. The event, co-organised between STIP Data Lab and the OECD TIP Working Party created a new space for government officials, policy analysts and researchers to jointly experiment with new approaches to STI policy analysis by using the most recent techniques in data science. It also offered (post-) graduate students and researchers a unique opportunity to be exposed to and contribute to resolving real-world challenges presented by STI policy experts. The event featured natural language processing, machine learning, cluster analysis and other methods applied to policy analysis, which revealed patterns in how governments leverage STI to achieve policy objectives or how they tackle specific societal challenges.

The hackathon demonstrated how such techniques can be used to summarise, compare and contrast corpuses of text documents describing strategic policy goals and initiatives into a reduced number of salient themes. It was useful to experiment with the new avenues data science is opening up for evidence-based policy research and advice. This document presents and summarises the projects conducted by postgraduate students and researchers from around the world and explains how the hackathon was organised.

Banner

How did the hackathon work?

OECD delegates and policy analysts led teams from six research institutions located in the US, Europe and Asia: Aalborg University, Fraunhofer Institute ISI, Georgia Institute of Technology, Korea Advanced Institute of Science and Technology, the Science Policy Research Unit (SPRU) at the University of Sussex and Tokyo University. Team leaders worked with the OECD Committee for Scientific and Technological Policy (CSTP) or with some of its working parties: the Innovation and Technology Policy (TIP), National Experts on Science and Technology Indicators (NESTI), and the Global Science Forum.

Team leaders formulated a policy question for their teams to address (Table 1) using two main data sources: the STIP Compass policy database and the TIP STI database on national strategies, possibly in conjunction with the STI.Scoreboard database. Policy questions were designed to be directly relevant to ongoing projects conducted by the CSTP and its working parties. In addition, the organising team helped to shape the questions to make them feasible to tackle using the datasets made available for the hackathon.

Table 1. Hackathon participants and their policy questions

	TEAM LEADER (TL)
Team	Team Leader	TL Organisation	Policy question
Aalborg University	Tiago Santos Pereira (TIP)	Centre for Social Studies, University of Coimbra	To what extent can we identify distinct instruments and country policy goals that reflect a novel co-creation approach vs. a more traditional knowledge-transfer approach?
Fraunhofer Institute	Joseba Sanmartín (NESTI)	Spanish Foundation for Science and Technology (FECYT)	In what ways do STI strategies differ between countries around the theme of sustainability (climate change) and, in particular, around the energy innovation system?
Georgia Institute of Technology	Caroline Paunov (TIP)	OECD	To what extent are countries’ green transition goals, as set out in their strategies, reflected in their STI policies?
Korea Advanced Institute of Science and Technology	Alan Paic (CSTP)	OECD	Can we characterise typologies of policies in support of making research data from publicly funded research openly accessible and reusable to the largest extent possible?
Science Policy Research Unit (SPRU), University of Sussex	Daniel Ferreira (GSF)	Portuguese National Funding Agency for Science, Research and Technology (FCT)	To what extent is it possible to characterise typologies of policy proposals on the theme of scientific employment and research careers?
Tokyo University	Philippe Larrue (CSTP)	OECD	What information is provided in strategies about policy implementation, such as specific goals, timelines, budgetary commitments or policy actions and/or their governance/monitoring?

Jan and I set up a dedicated GitHub repository to provide more details about the three datasets, together with guidance and examples for accessing and leveraging them in analyses using Python and R. Teams also had the possibility to ask questions and upload their materials using this repository. Over the course of two weeks, teams were asked to devote at least two full days (equivalent) of work time to tackle their question. Knowing that teams only had limited availability during the hackathon, it was made clear that no definitive answers were expected. Rather, the desired goal of the hackathon was for them to propose one or more innovative approaches and possibly feature some initial observations from the data.

What were the outcomes of the hackathon?

During the 7 June closing event, teams gave ten-minute presentations summarising their ideas and findings. From the outset it became clear that teams devoted considerable efforts, well beyond our expectations, to come up with a variety of methodologies to tackle real policy questions.

Aalborg University

The Aalborg University team was tasked with identifying policy goals and approaches that reflect a novel co-creation approach for research and innovation, as opposed to the traditional knowledge-transfer approach. To achieve this, they worked with their team leader to devise a classification strategy using representative keywords, i.e. for co-creation (e.g. co-funding, multilateral, platforms) and for knowledge transfer (e.g. bilateral, innovation vouchers, spin-off). This strategy suggests a proliferation of the co-creation approach in STIP Compass policy initiatives starting in 2015. It also allowed the team to identify salient features of co-creation policies. For instance, they tend to jointly address a higher variety of joint beneficiaries and policy themes. The team also proposed a word2vec word embedding approach to calculate the pairwise similarity between policy initiatives. Using these similarities, they built a network of policies to identify thematic clusters making a recurrent use of co-creation keywords, such as climate change, COVID-19 and artificial intelligence.

Access the Aalborg University team’s presentation, R/Python code and other materials.

Fraunhofer Institute ISI

The Fraunhofer Institute ISI team aimed to identify trends in national strategies that are relevant for the energy transition. To this end, they used the definition proposed for such policies in the STIP Compass database to extract pivot sentences. This allowed them to build a model that classifies sentences in the TIP strategies corpus as relevant for the energy transition. These relevant sentences were semantically analysed using BERT topic modelling, allowing the team to isolate the main strategic themes in the text such as hydrogen technologies, transportation and waste management, among others. In a second step, they applied the language model to the STIP Compass dataset, to explore the policy coherence between both datasets. Among other patters, they found that countries raise the energy transition more frequently in policies supporting the public research system as opposed to those that support business innovation.

Access the Fraunhofer Institute ISI team’s presentation.

Georgia Institute of Technology

The Georgia Institute of Technology team compared goals for the green transition, as set out in national strategies, with the actual policy initiatives countries have in place. To tackle this challenge, they started by splitting the +300 strategies in the TIP corpus into +380k sentences. The team scanned 800 of these sentences and assessed whether they were related to green transitions or not. They used this as the “ground truth” training set to train a BERT language model that estimated the probability for the remaining sentences in the TIP corpus to be relevant. The team used this model to describe the strategies corpus and STIP Compass policy initiative data, proposing metrics that estimated how intensively countries address transition goals and how broad the thematic scope of such goals is.

Access the Georgia Tech team’s presentation.

Korea Advanced Institute of Science and Technology

The Korea Advanced Institute of Science and Technology (KAIST) team sought to identify typologies of policies in support of open science and enhanced access to publications and research data. They used the taxonomies proposed by the STIP Compass data model to characterise the various instruments that countries have introduced in this field. For instance, the team observed that most national strategies in support of open science have limited degree of coordination in their implementation. However, they propose various monitoring mechanisms such as periodic reporting or evaluation, dedicated budget allocations, a dedicated monitoring public body, or a combination of these. The team also explored topic modelling to identify emerging open science themes in the TIP strategies corpus and the STIP Compass data, such as international cooperation, public research and business innovation.

Access the KAIST team’s presentation and Python code.

University of Sussex’s Science Policy Research Unit

The team from the University of Sussex’s Science Policy Research Unit (SPRU) identified typologies of policies with a focus on scientific employment and research careers. To this end, they decided to leverage the relevant slice of the STIP Compass dataset for semantic analysis. This analysis consisted of calculating a TFIDF (term frequency-inverse document frequency) matrix using the text found in the initiatives’ description and objective fields. They then applied a PCA analysis to identify two components that explained 20% of the variance of the TFIDF matrix. Using k-means clustering, countries were grouped by their similar policy mixes across the two components. This allowed the team to identify clusters in policy mixes around employment creation, research capabilities, gender equality and private sector research.

Access the SPRU team’s presentation and R code.

Tokyo University

The Tokyo University team had the task to analyse implementation plans in national strategies, whether by setting out specific targets, timelines, budget allocations, follow-up mechanisms or a combination of these. In a first step, the team sought to identify relevant sections (blocks of texts) in the TIP corpus of strategies using a list of keywords indicative of implementation. These sections of the text were then analysed using topic modelling, in view of identifying relevant themes such as public finance, assessment/evaluation, and education and training, among others. Using this model, the team calculated pairwise country similarities in their approach to strategy implementation. The team also extended their approach to identify strategies proposing specifically a mission-oriented approach. They found that such missions recurrently address the topics of carbon emissions, energy and digitalisation, among others.

Access the Tokyo University team’s presentation and Python code.

How was the hackathon organised?

Contacting research institutions with a known background in applied data science for STI policy analysis. The OECD shared a hackathon concept note to explore their interest in participating. Research institutions formed teams of 4-8 members including MSc. and PhD students and postgraduates.
Requesting research institutions’ available dates for kicking-off, working in and closing the hackathon.
Contacting delegates and senior analysts available and willing to take-up the role as team leaders and formulate a policy question (with the organising team’s guidance).
Setting up a dedicated GitHub repository providing the necessary guidance and examples to access the datasets to be used during the hackathon. The “Issues” tab allowed teams to ask questions around the datasets, which automatically prompted the right member for the organising team for an answer (email notification).
The kick-off event introduced the hackathon’s datasets and the specific policy questions each team would tackle. Teams had the opportunity to discuss the question with their team leader.
Teams worked in the hackathon over the course of two weeks, organising one or two sprint meetings with their team leader to discuss progress and receive feedback.
Teams were invited to prepare 10-minute presentations summarising their work for the closing event.
The OECD organised a separate technical workshop where teams were able to elaborate and exchange on their methodologies following the closing event. This allowed teams to debrief and discuss the models, model parameters, software packages, libraries and other technical choice they made during the hackathon.

Learn more

Data science for STI policy analysis

STIP Data Lab

Policy data

STIP Compass database

Contact