Constitution of the corpus

References acquisition, management & access

September 2024

Nicolas Casajus

Senior data scientist
@FRB-CESAB

Reference acquisition

Reference management

Reference access

Remove duplicates

Extras

Exercise

Reference acquisition

Two methods

Web interface

Web interface

File formats

Plain text (.txt)

FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Buisson, L
   Thuiller, W
   Casajus, N
   Lek, S
   Grenouillet, G
AF Buisson, Laetitia
   Thuiller, Wilfried
   Casajus, Nicolas
   Lek, Sovan
   Grenouillet, Gael
TI Uncertainty in ensemble forecasting of species distribution
SO GLOBAL CHANGE BIOLOGY
PY 2010
VL 16
IS 4
BP 1145
EP 1157
DI 10.1111/j.1365-2486.2009.02000.x
UT WOS:000274813800001
ER
EF%

RIS format (.ris)

TY  - JOUR
AU  - Buisson, L
AU  - Thuiller, W
AU  - Casajus, N
AU  - Lek, S
AU  - Grenouillet, G
TI  - Uncertainty in ensemble forecasting of species distribution
T2  - GLOBAL CHANGE BIOLOGY
PY  - 2010
VL  - 16
IS  - 4
SP  - 1145
EP  - 1157
DO  - 10.1111/j.1365-2486.2009.02000.x
AN  - WOS:000274813800001
ER  -

File formats

BibTeX format (.bib)

@article{buisson-2010-gcb,
  author  = {Buisson, Laetitia and Thuiller, Wilfried and Casajus, Nicolas and 
             Lek, Sovan and Grenouillet, Ga{\"{e}}l},
  title   = {Uncertainty in ensemble forecasting of species distribution},
  journal = {Global Change Biology},
  year    = {2010},
  volume  = {16},
  number  = {4},
  pages   = {1145-1157},
  doi     = {10.1111/j.1365-2486.2009.02000.x}
}

File formats

BibTeX format (.bib)

@article{buisson-2010-gcb,
  author  = {Buisson, Laetitia and Thuiller, Wilfried and Casajus, Nicolas and 
             Lek, Sovan and Grenouillet, Ga{\"{e}}l},
  title   = {Uncertainty in ensemble forecasting of species distribution},
  journal = {Global Change Biology},
  year    = {2010},
  volume  = {16},
  number  = {4},
  pages   = {1145-1157},
  doi     = {10.1111/j.1365-2486.2009.02000.x}
}

Structure

% Entry type
@article{ ... }

% Citekey
buisson-2010-gcb

% Key-value pairs
journal = {Global Change Biology}

% or
journal = "Global Change Biology"

Advantages:

Easy to read and to understand
Good support for accents and case
Recognized by R, Rmarkdown, Quarto, LaTeX, Zotero/Mendeley, etc.

Suggested reading: Quick BibTeX Guide

Application Programming Interface (API)

Application Programming Interface (API)

Description/Advantages

Command line interface
Good for automation & reproducibility
Structured raw data (JSON, XML, etc.)
Various clients available (in R, Python, etc.)
Available for Web of Science, Scopus, OpenAlex
No API for Google Scholar (web scraping)

Application Programming Interface (API)

Description/Advantages

Command line interface
Good for automation & reproducibility
Structured raw data (JSON, XML, etc.)
Various clients available (in R, Python, etc.)
Available for Web of Science, Scopus, OpenAlex
No API for Google Scholar (web scraping)

Limitations

Requires a token (authentication)
Sometimes not free (WoS)
Number of records per request
Number of requests per month
Incomplete data (i.e. abstract)

WoS Starter API

You need to require an API token

Metadata available:

Document identifier
Document type
Title
Authors
Published year
Published month
Source
Volume
Issue
Pages
Article number
Book editors
Keywords
DOI
ISSN
ISBN
PMID
Time cited

WoS Starter API client

GitHub repo

WoS Starter API client

Installation:

# Installation ----
install.packages("remotes")
remotes::install_github("frbcesab/rwosstarter")

WoS Starter API client

Installation:

# Installation ----
install.packages("remotes")
remotes::install_github("frbcesab/rwosstarter")

Usage:

# Search equation ----
query <- 'AU="Casajus N"'

# Get the total number of records ----
rwosstarter::wos_search(query, database = "WOS")
## [1] 28

# Download records metadata ----
refs <- rwosstarter::wos_get_records(query, database = "WOS")

Output (no abstract):

Reference management

Reference management software

Allows you to:

store, organize, and annotate references
retrieve metadata & full text (connection to databases)
easily add new references (plugins for Web browsers)
insert in-text citations (plugins for Word, Writer, RStudio)
generate a bibliography
share libraries and collaborate (online account)

Software comparison

	`EndNote`	`Mendeley`	`Zotero`
OS
License	Proprietary (Clarivate)	Proprietary (Elsevier)	Open source (AGPL)
Pricing	> 200 $	Free	Free
Online storage	2 GB (+ pricing options)	2 GB (+ pricing options)	300 MB (+ pricing options)
Citation styles	> 7000	> 7000	> 10,000
Google Doc
LaTeX
Customization

Zotero

Open source and free
Well documented
Active community
Web browser connector
Word processor plugin
A lot of plugins
A lot of styles
Support for LaTeX, BibTeX, and RStudio

Zotero plugins

Zotero connector
- Available for the most popular Web browsers:
- Add items to your Zotero library in one click
- Add PDF (if open access)

Word processor plugin
- Available for Word, Writer and Google Doc
- Insert in-text citations
- Generate bibliography according to a selected style

Citation picker for VS Code (VS Codium)
- Citation Picker for Zotero

Zotero plugins

Better BibTeX
- Improve the compatibility with LaTeX, Markdown, and R
- Auto-generate citation keys
- Auto-export BibTeX files (one file per collection)

ZotFile
- Rename attachments (with a lot of naming rules)
- Move attachments (to a specific location)

Better Notes
- Enhance note editor (markdown)
- Knowledge analysis
- Note templates
- Export notes (PDF, DOCX, etc.)

DOI manager
- Look up DOI from CrossRef
- Check DOI validity
- Clean the DOI field

Official list

Zotero and RStudio

Require a recent version of RStudio
Edit R Markdown file using the Visual R Markdown editing mode
Easily insert in-text citations
Auto-generate BibTex file

Zotero and RStudio

Require a recent version of RStudio
Edit R Markdown file using the Visual R Markdown editing mode
Easily insert in-text citations
Auto-generate BibTex file

YAML header:

---
title: "My document"
bibliography: references.bib
link-citations: true
---

More information

Reference access

Access from Zotero

CSV export

BibTeX export

SQL query

Zotero API

Access from Zotero

CSV export

BibTeX export

SQL query

Zotero API

Export references from Zotero

Export references as CSV file

Export references as BibTeX file

Importing BibTeX into R

GitHub repo

Importing BibTeX into R

Installation:

# Installation ----
install.packages("remotes")
remotes::install_github("frbcesab/rbibtools")

Usage:

# Import BibTeX files ----
refs <- rbibtools::read_bib(path = "folder_with_bib_files")

# Class ----
class(refs)
## [1] "data.frame"

Output:

Query Zotero SQL database

GitHub repo

Query Zotero SQL database

Installation:

# Installation ----
install.packages("remotes")
remotes::install_github("frbcesab/zoteror")

Usage:

# Import BibTeX files ----
refs <- zoteror::get_zotero_data(path = "folder_with_zotero.sqlite")

# Class ----
class(refs)
## [1] "data.frame"

Output:

Remove duplicates

Detect duplicated references

Many available tools:

Zotero & Excel - Painful
R package revtools - No more maintained
R package synthesisr
R package bibliometrix
and many more…

Let’s keep it simple:

The duplicated() function

The `duplicated()` function

Easy case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change and biodiversity", 
              "Climate change and biodiversity",
              "Climate change and biodiversity",
              "Fisheries and management - Part I", 
              "Fisheries and management - Part II"))
refs

##                                title
## 1    Climate change and biodiversity
## 2    Climate change and biodiversity
## 3    Climate change and biodiversity
## 4  Fisheries and management - Part I
## 5 Fisheries and management - Part II

The `duplicated()` function

Easy case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change and biodiversity", 
              "Climate change and biodiversity",
              "Climate change and biodiversity",
              "Fisheries and management - Part I", 
              "Fisheries and management - Part II"))
refs

##                                title
## 1    Climate change and biodiversity
## 2    Climate change and biodiversity
## 3    Climate change and biodiversity
## 4  Fisheries and management - Part I
## 5 Fisheries and management - Part II

# Detect duplicates ----
dups <- duplicated(refs$"title")
dups

## [1] FALSE  TRUE  TRUE FALSE FALSE

The `duplicated()` function

Easy case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change and biodiversity", 
              "Climate change and biodiversity",
              "Climate change and biodiversity",
              "Fisheries and management - Part I", 
              "Fisheries and management - Part II"))
refs

##                                title
## 1    Climate change and biodiversity
## 2    Climate change and biodiversity
## 3    Climate change and biodiversity
## 4  Fisheries and management - Part I
## 5 Fisheries and management - Part II

# Detect duplicates ----
dups <- duplicated(refs$"title")
dups

## [1] FALSE  TRUE  TRUE FALSE FALSE

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs

##                                title duplicate
## 1    Climate change and biodiversity         0
## 2    Climate change and biodiversity         1
## 3    Climate change and biodiversity         1
## 4  Fisheries and management - Part I         0
## 5 Fisheries and management - Part II         0

The `duplicated()` function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))

The `duplicated()` function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))

# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs

##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

The `duplicated()` function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))

# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs

##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))

## [1] "climate change"   "climate change"   "climate- change."

The `duplicated()` function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))

# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs

##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))

## [1] "climate change"   "climate change"   "climate- change."

# Remove punctuation ----
(titles <- gsub("[[:punct:]]", " ", titles))

## [1] "climate change"   "climate change"   "climate  change "

The `duplicated()` function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))

# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs

##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))

## [1] "climate change"   "climate change"   "climate- change."

# Remove punctuation ----
(titles <- gsub("[[:punct:]]", " ", titles))

## [1] "climate change"   "climate change"   "climate  change "

# Remove multi-whitespace ----
(titles <- gsub("\\s+", " ", titles))

## [1] "climate change"  "climate change"  "climate change "

The `duplicated()` function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))

# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs

##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))

## [1] "climate change"   "climate change"   "climate- change."

# Remove punctuation ----
(titles <- gsub("[[:punct:]]", " ", titles))

## [1] "climate change"   "climate change"   "climate  change "

# Remove multi-whitespace ----
(titles <- gsub("\\s+", " ", titles))

## [1] "climate change"  "climate change"  "climate change "

# Remove leading and trailing whitespace ----
(titles <- trimws(titles))

## [1] "climate change" "climate change" "climate change"

The `duplicated()` function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))

# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs

##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))

## [1] "climate change"   "climate change"   "climate- change."

# Remove punctuation ----
(titles <- gsub("[[:punct:]]", " ", titles))

## [1] "climate change"   "climate change"   "climate  change "

# Remove multi-whitespace ----
(titles <- gsub("\\s+", " ", titles))

## [1] "climate change"  "climate change"  "climate change "

# Remove leading and trailing whitespace ----
(titles <- trimws(titles))

## [1] "climate change" "climate change" "climate change"

# Detect duplicates ----
dups <- duplicated(titles)

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs

##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         1
## 3 Climate- change.         1

Extras

OpenAlex

A bibliographic catalogue of scientific papers, authors and institutions accessible in open access mode¹.

Competes with commercial products such as Clarivate’s Web of Science or Elsevier’s Scopus.
Provides bibliometrics tools
Provides an API
Free account
Download the whole database

Website: https://openalex.org

OpenAlex

A lightweight interface

OpenAlex API R client

GitHub repo

OpenAlex API R client

Installation:

# Installation ----
install.packages("openalexR")

Usage:

# Be polite and tell who you are ----
options(openalexR.mailto = "anonymous@mail.com")

# DOI to search for ----
dois <- c("10.1371/journal.pbio.3001640", 
          "10.1038/s41597-023-02264-2")

# Retrieve document metadata ----
metadata <- openalexR::oa_fetch(entity = "works", doi = dois)

Output (with abstract):

Exercise

Download the two following .bib files:

refs-scopus.bib available here
refs-webofscience.bib available here

Import the two .bib files in Zotero

Create a collection for this exercise
Import each .bib file in its own subcollection

Import references in R

Install the package zoteror available here
Use the function get_zotero_data() to import Zotero references
Select only references from the two collections

Detect duplicated references

Use the function duplicated() on the DOI to identify duplicated references
Add a column with 1 (duplicate) and 0 (no duplicate)

Export the final table

Use the package writexl to export the table as a xlsx file

Correction

Data acquisition

# Folder to save .bib files ----
path <- "~/Documents/Exercise"
dir.create(path, recursive = TRUE)

# Download .bib files ----
repo_url   <- paste0("https://raw.githubusercontent.com/", 
                     "literaturesynthesis/", 
                     "corpus-management/main/data/")

filename_1 <- "refs-scopus.bib"

download.file(url      = paste0(repo_url, filename_1), 
              destfile = file.path(path, filename_1))

filename_2 <- "refs-webofscience.bib"

download.file(url      = paste0(repo_url, filename_2), 
              destfile = file.path(path, filename_2))

# Zotero step ----
## ...

Correction

Data acquisition

# Folder to save .bib files ----
path <- "~/Documents/Exercise"
dir.create(path, recursive = TRUE)

# Download .bib files ----
repo_url   <- paste0("https://raw.githubusercontent.com/", 
                     "literaturesynthesis/", 
                     "corpus-management/main/data/")

filename_1 <- "refs-scopus.bib"

download.file(url      = paste0(repo_url, filename_1), 
              destfile = file.path(path, filename_1))

filename_2 <- "refs-webofscience.bib"

download.file(url      = paste0(repo_url, filename_2), 
              destfile = file.path(path, filename_2))

# Zotero step ----
## ...

Data cleaning

# Access references ----
refs <- zoteror::get_zotero_data(path = "~/zotero")

# Select collections ----
refs <- refs[refs$"collection" %in% c("WOS", "Scopus"), ]

# Detect duplicates (based on DOI) ----
duplicated_doi <- duplicated(refs$"doi")

# Store information ----
refs$"duplicated" <- ifelse(duplicated_doi, 1, 0)

# Number of duplicates ----
sum(duplicated_doi)

## Export .xlsx file ----
writexl::write_xlsx(refs, file.path(path, "unique_references.xlsx"))

Table of contents

Table of contents

Reference acquisition

Two methods

Web interface

Web interface

File formats

File formats

File formats

File formats

Application Programming Interface (API)

Application Programming Interface (API)

Application Programming Interface (API)

WoS Starter API

WoS Starter API client

WoS Starter API client

WoS Starter API client

Reference management

Reference management software

Software comparison

Zotero

Zotero plugins

Zotero plugins

Zotero and RStudio

Zotero and RStudio

Reference access

Access from Zotero

Access from Zotero

Export references from Zotero

Export references as CSV file

Export references as BibTeX file

Importing BibTeX into R

Importing BibTeX into R

Query Zotero SQL database

Query Zotero SQL database

Remove duplicates

Detect duplicated references

The duplicated() function

The duplicated() function

The duplicated() function

The duplicated() function

The duplicated() function

The duplicated() function

The duplicated() function

The duplicated() function

The duplicated() function

The duplicated() function

Extras

OpenAlex

OpenAlex

OpenAlex API R client

OpenAlex API R client

Exercise

Exercise

Exercise

Correction

Correction

The `duplicated()` function

The `duplicated()` function

The `duplicated()` function

The `duplicated()` function

The `duplicated()` function

The `duplicated()` function

The `duplicated()` function

The `duplicated()` function

The `duplicated()` function

The `duplicated()` function