This R tutorial uses a STIP Compass dataset for the quantitative analysis of texts on national science, technology and innovation (STI) policy initiatives. It demonstrates how you can quickly get started with natural language processing (NLP) methods using the quanteda package. The tutorial has three sections. The first section shows how to load packages providing functions that we need to work with the data into R, and how to download the dataset to be analyzed. The second section shows how to prepare the dataset to analyse descriptive textual information on policy initiatives. The third section shows a way to pre-process the text data, before conducting some basic analyses and suggesting further resources for NLP analysis.

1. Preparation: load R packages and download data

First, we install and load two packages.

The first package, quanteda, contains functions for the quantitative of textual data - the subject of this tutorial.
The second package, 'tidyverse', is really a set of packages. They help us to open the STIP dataset after downloading it, and allow for convenient data manipulation

if (!require(quanteda)){install.packages('quanteda')}
if (!require(quanteda.textplots)){install.packages('quanteda.textplots')}
if (!require(quanteda.textstats)){install.packages('quanteda.textstats')}
if (!require(tidyverse)){install.packages('tidyverse')}

First, we download the most recent STIP Compass dataset (this may take a while). The function 'here' takes care of specifying your working directory. Then, we read in the downloaded dataset. To see where R has stored the dataset, check the address of your working directory, just type 'getwd()' in the R console.

url <- 'https://stip.oecd.org/assets/downloads/STIP_Survey.csv'

#download the dataset
download.file(url, destfile = 'stip.csv', mode = 'wb')

#load the dataset into our working environment
stip <- read_delim('stip.csv', '|', escape_double = FALSE, trim_ws = TRUE)

2. Prepare the dataset

Next, we trim the dataset. In its initial form, policy initiatives can be included several times in the CSV file, once for each instrument that comes together with them. This makes the dataset hard to handle if we just want to look at policy initiatives. Moreover, most of its 800+ columns are about specific and detailed information for the instruments reported. As we do not need this information in this tutorial, we drop it as follows.

#Most columns with info on instruments start with the Letter 'F' followed by a number. this removes all columns matching these characteristics
stip <- stip[,!grepl('^[F][0-9]', (names(stip)))]

#There are a few more columns with information on instruments. We remove them, too for the sake of coherence
stip <- stip[,!grepl('Instrument', (names(stip)))]

#This code identifies unique Initiative IDs. When multiple rows have the same initiative IDs, it retains only one of them. Since we have already removed all the information on instruments, no information is lost by retaining each initiative only once
stip <- stip %>%
  distinct(InitiativeID, .keep_all = T)

The first row of the dataset does not contain actual data, but descriptions of the variables. We extract this row and create a dataframe from it that we call our codebook. Then, we remove the first row of the dataset as we do not need it anymore. Then, we look at the codebook to get a first impression of the dataset.

codebook <- as.data.frame(t(stip[1,])) %>%
  rownames_to_column()

names(codebook) <- c('Variable', 'Code')

#...remove the first row from the dataset: 
stip <- stip[-1, ]

#take a look at the codebook
head(codebook) #The first few variables names are mostly self-explanatory

ABCDEFGHIJ0123456789

	Variable <chr>	Code <fct>
1	InitiativeID	Policy Initiative URI
2	SurveyYear	Year of the survey
3	NameEnglish	English name
4	CountryLabel	Country name
5	CountryCode	Country code
6	NameOriginalLanguage	Name in original language

tail(codebook) #For other variables, the codebook is instructive (note that 'TG' stands for 'Target Group')

ABCDEFGHIJ0123456789

	Variable <chr>	Code <fct>
117	TG35	Technology transfer offices
118	TG36	Industry associations
119	TG37	Academic societies / academies
120	TG38	Secondary education students
121	TG40	International entity
122	TG9	Established researchers

Some data cleaning: The 'InitiativeID' column contains a link that ends with an individual identifier for each initiative. We remove the link and retain only the identifying number.

stip <- stip %>%
  mutate(InitiativeID = as.numeric(gsub('http://stip.oecd.org/2019/data/policyInitiatives/', '', InitiativeID)))

3. Quantitative text analysis

This tutorial focuses on analysing textual data describing policy initiatives in the dataset. The STIP data has several columns with textual information. There is a 'Description' column, several 'Objectives' columns and a 'Background' column. We combine the columns with descriptions and objectives into a new, merged column. We do not include the column with background information, since in the survey, it is not mandatory to provide background information. Moreover, respondents use the "background" information in various ways and the information often is less descriptive of the initiative itself but rather elaborating on the context in which it was introduced. The new merged column contains in one place all the textual information to be analysed in this tutorial, for each inititiave. We will refer to data in this column as the "documents" that we will analyse.

#this creates a vector with the names of all columns we wish to unite
cols <- c('ShortDescription', names(stip)[grepl('Objectives', names(stip))])

#this unites these columns in the new column 'all_texts'
stip$all_texts <- apply(stip[ ,cols], 1, paste, collapse = ' ')

#take a look at the first few new documents (i.e. the pieces of textual data that we will analyse)
head(stip$all_texts, 3)

## [1] "INTER enables the FNR to initiate bi or multilateral arrangements for project calls in conjunction with other national or international funding bodies. To give Luxembourg’s public research a higher profile in the international context by providing funding for international collaboration. NA NA NA NA NA"                                                                                                                                                                                                                                                                                                                                                                                                  
## [2] "A multi-annual thematic research programme to strengthen the scientific quality of Luxembourg’s public research in the country’s priority research domains. Funding of high quality scientific research, leading to the generation of new knowledge and scientific publications in the leading international peer-reviewed outlets of the respective fields. Development of a strong research basis in Luxembourg which can be exploited for sustainable long-term socio-economic and environmental benefits. Advancement of the research group or institution in view of international visibility and critical mass. Training of doctoral students and advancement of the involved researchers in general. NA NA"
## [3] "The programme provides financial support to cover article processing charges that may arise through the publication of peer-reviewed research results in Open Access To promote the free access to research results from FNR-(co)funded projects. NA NA NA NA NA"

3.1. Prepare and pre-process textual data

To analyse the text data on policy initiatives we first build a corpus from the newly created documents. In the corpus, each initiative has an associated document, identified by the initiative's ID. The information from all other columns in the STIP dataset becomes metadata to the documents.

stip_corp <- corpus(stip, docid_field = 'InitiativeID', text_field = 'all_texts')

#take a look
stip_corp

## Corpus consisting of 5,663 documents and 121 docvars.
## 1325 :
## "INTER enables the FNR to initiate bi or multilateral arrange..."
## 
## 1327 :
## "A multi-annual thematic research programme to strengthen the..."
## 
## 1328 :
## "The programme provides financial support to cover article pr..."
## 
## 1330 :
## "The IBBL Institute is an autonomous not-for-profit institute..."
## 
## 1331 :
## "The 3rd industrial revolution strategy is a national initiat..."
## 
## 1332 :
## "The IPBG awards a block allocation of PhD and/or Postdoc gra..."
## 
## [ reached max_ndoc ... 5,657 more documents ]

Next, we create a document-feature matrix (dfm) from the corpus. Many techniques of quantitative text analysis use a dfm as their input. In the dfm, each row is a document, and each column is a word. Cells indicate the number of times a word appears in a document. The dfm does not retain the order of words in document. Rather, it treats documents as bags of words. After creating the dfm, we remove numbers and english stopwords which are short function words (such as 'to', 'and', 'or'). We also remove all words with less than 3 characters.

stip_dfm <- dfm(stip_corp)

stip_dfm <- stip_dfm %>%
  dfm_remove(stopwords('english'), min_nchar = 3) %>%
  dfm_remove(pattern = '(?<=\\d{1,9})\\w+', valuetype = 'regex' )

#Take a look: This dfm has still more 10000 features
stip_dfm

## Document-feature matrix of: 5,663 documents, 14,471 features (99.8% sparse) and 121 docvars.
##       features
## docs   inter enables fnr initiate multilateral arrangements project calls
##   1325     1       1   1        1            1            1       1     1
##   1327     0       0   0        0            0            0       0     0
##   1328     0       0   0        0            0            0       0     0
##   1330     0       0   0        0            0            0       0     0
##   1331     0       0   0        0            0            0       0     0
##   1332     0       0   0        0            0            0       0     0
##       features
## docs   conjunction national
##   1325           1        1
##   1327           0        0
##   1328           0        0
##   1330           0        0
##   1331           0        1
##   1332           0        0
## [ reached max_ndoc ... 5,657 more documents, reached max_nfeat ... 14,461 more features ]

Documents tend to contain a lot of information irrelevant for the analysis, such as stylistic and rare expressions. Often, it is a goal to reduce the number of features in the dfm during pre-processing. This makes it easier to conduct analyses and to arrive at clear-cut results. Therefore, we pre-process the dfm further, inter alia by reducing all words to their wordstem.

stip_dfm  <- stip_dfm %>% 
  dfm_wordstem() %>% #stem the dfm
  dfm_trim(min_docfreq = 0.01,  docfreq_type = 'prop') %>% # retain only words included in at least 1% of documents
  dfm_subset(ntoken(stip_dfm) >= 10) # remove documents with less than 10 words

#Take a look again: Now, we have substantially reduced the number of features to less than 1000
stip_dfm

## Document-feature matrix of: 5,532 documents, 638 features (95.9% sparse) and 121 docvars.
##       features
## docs   enabl initi project call nation intern fund bodi give public
##   1325     1     1       1    1      1      3    2    1    1      1
##   1327     0     0       0    0      0      2    1    0    0      2
##   1328     0     0       1    0      0      0    1    0    0      1
##   1330     0     0       0    0      0      0    0    0    0      0
##   1331     0     1       0    0      1      0    0    0    0      0
##   1332     0     0       0    0      0      0    0    0    0      4
## [ reached max_ndoc ... 5,526 more documents, reached max_nfeat ... 628 more features ]

The dataset also contains a column with innovation-related keywords for each initiative (from a dedicated vocabulary of concepts). This is highly useful for the analysis, so we generate a second dfm from it. We generate this dfm in another way than the previous one since the unit of analysis in this case are not words, but keywords often consisting of multi-word expressions.

tag_dfm <- tokenizers::tokenize_regex(stip$Tags, pattern = '¬') %>%
  as.tokens() %>%
  dfm() %>%
  dfm_remove(min_nchar = 3)

rownames(tag_dfm) <- stip$InitiativeID
docvars(tag_dfm) <- stip

tag_dfm

## Document-feature matrix of: 5,663 documents, 873 features (99.6% sparse) and 123 docvars.
##       features
## docs   business intelligence funding agencies international collaboration
##   1325                     1                1                           1
##   1327                     0                0                           0
##   1328                     0                0                           0
##   1330                     0                0                           0
##   1331                     0                0                           0
##   1332                     0                0                           0
##       features
## docs   critical mass societal challenge phd students research groups
##   1325             0                  0            0               0
##   1327             1                  1            1               1
##   1328             0                  0            0               0
##   1330             0                  0            0               0
##   1331             0                  0            0               0
##   1332             0                  0            0               0
##       features
## docs   research priorities research programmes resource management
##   1325                   0                   0                   0
##   1327                   1                   1                   1
##   1328                   0                   0                   0
##   1330                   0                   0                   0
##   1331                   0                   0                   0
##   1332                   0                   1                   0
## [ reached max_ndoc ... 5,657 more documents, reached max_nfeat ... 863 more features ]

3.2. Analyze textual data

We can now analyze the two dfms in many ways, depending on our interest. A first step might be to look at the most common features in the dfm.

textplot_wordcloud(stip_dfm)

We can also generate a wordcloud using the dataset's keywords:

textplot_wordcloud(tag_dfm, max_words =100, min_count=3,max_size = 2, min_size = .5)

An interesting question that we can ask is how does language in different subsets of policy initiatives differ. Many different subsets are conceivable. To give an example, we consider how word use in initiatives linked to the theme 'Financial support to business R&D and innovation' stand out in comparison to all other initiatives. Consulting the codebook that we have generated earlier, we can see that this theme corresponds to the variable 'TH31'. We use the first dfm (based on the merged column of the text data) which captures more fine-grained details on policy initiatives, compared to the second dfm (based on the dataset's keywords).

fs_keyness <- textstat_keyness(stip_dfm, 
                              target = stip_dfm$TH31 == 1)
textplot_keyness(fs_keyness)

We can investigate the theme 'Financial support to business R&D and innovation' further by considering only documents in the dataset linked to this theme. Below, we create a subset dfm containing only documents on this theme and then compare the initiatives from Canada to all the others. We see that Canadian policy initiatives have a much stronger focus on female innovators that the average.

fs_dfm <- dfm_subset(stip_dfm, stip_dfm$TH31 == 1) 
  
can_keyness <- textstat_keyness(dfm_remove(fs_dfm, pattern = c('canada', 'canadian')), 
                              target = fs_dfm$CountryCode == 'CAN')

textplot_keyness(can_keyness)

One of many other options is to compare the documents from different countries. The results of such a comparison should be treated with much caution since countries report information in different ways. For example, the comparison below does not consider whether some countries report more information on particular themes or survey questions than others. Moreover, this analysis assigns similar weight to all initiatives, although their budgets differ substantially. However, the analysis still reveals similarities between countries as one may expect. For instance, Germany and Austria come out as close to each other, so do the USA, Canada and Australia.

#create a dfm that merges all documents by country
dfm_countries <- dfm_group(stip_dfm, groups = CountryCode)

#computes distances between documents from different countries
tstat_dist <- as.dist(textstat_dist(dfm_countries))

#cluster countries based on these distances
user_clust <- hclust(tstat_dist)

plot(user_clust, cex = 0.5)

Final notes:

More advanced NLP analyses could leverage more information from the STIP dataset, including budget information, for better explorations of the dataset.
Many online resources for NLP might help to analyse this dataset. Two of these are:
- https://tutorials.quanteda.io/
- https://www.structuraltopicmodel.com/

Getting Started with NLP of Research and Innovation Policy Data using R

A Tutorial