Constitution of the corpus

References acquisition, management & access

September 2024

Nicolas Casajus

Senior data scientist
@FRB-CESAB    

Table of contents

Table of contents







Reference acquisition



Reference management



Reference access





Remove duplicates



Extras



Exercise

  Reference acquisition

Two methods





Through a Web interface

Through an API

Web interface

WoS records

Web interface

WoS export format

File formats

File formats

Plain text (.txt)

FN Clarivate Analytics Web of Science
VR 1.0
PT J
AU Buisson, L
   Thuiller, W
   Casajus, N
   Lek, S
   Grenouillet, G
AF Buisson, Laetitia
   Thuiller, Wilfried
   Casajus, Nicolas
   Lek, Sovan
   Grenouillet, Gael
TI Uncertainty in ensemble forecasting of species distribution
SO GLOBAL CHANGE BIOLOGY
PY 2010
VL 16
IS 4
BP 1145
EP 1157
DI 10.1111/j.1365-2486.2009.02000.x
UT WOS:000274813800001
ER
EF%

RIS format (.ris)

TY  - JOUR
AU  - Buisson, L
AU  - Thuiller, W
AU  - Casajus, N
AU  - Lek, S
AU  - Grenouillet, G
TI  - Uncertainty in ensemble forecasting of species distribution
T2  - GLOBAL CHANGE BIOLOGY
PY  - 2010
VL  - 16
IS  - 4
SP  - 1145
EP  - 1157
DO  - 10.1111/j.1365-2486.2009.02000.x
AN  - WOS:000274813800001
ER  -

File formats

BibTeX format (.bib)

@article{buisson-2010-gcb,
  author  = {Buisson, Laetitia and Thuiller, Wilfried and Casajus, Nicolas and 
             Lek, Sovan and Grenouillet, Ga{\"{e}}l},
  title   = {Uncertainty in ensemble forecasting of species distribution},
  journal = {Global Change Biology},
  year    = {2010},
  volume  = {16},
  number  = {4},
  pages   = {1145-1157},
  doi     = {10.1111/j.1365-2486.2009.02000.x}
}

File formats

BibTeX format (.bib)

@article{buisson-2010-gcb,
  author  = {Buisson, Laetitia and Thuiller, Wilfried and Casajus, Nicolas and 
             Lek, Sovan and Grenouillet, Ga{\"{e}}l},
  title   = {Uncertainty in ensemble forecasting of species distribution},
  journal = {Global Change Biology},
  year    = {2010},
  volume  = {16},
  number  = {4},
  pages   = {1145-1157},
  doi     = {10.1111/j.1365-2486.2009.02000.x}
}

Structure

% Entry type
@article{ ... }

% Citekey
buisson-2010-gcb

% Key-value pairs
journal = {Global Change Biology}

% or
journal = "Global Change Biology"


Advantages:

  • Easy to read and to understand
  • Good support for accents and case
  • Recognized by R, Rmarkdown, Quarto, LaTeX, Zotero/Mendeley, etc.


 Suggested reading: Quick BibTeX Guide

Application Programming Interface (API)

Application Programming Interface (API)

Description/Advantages

  • Command line interface
  • Good for automation & reproducibility
  • Structured raw data (JSON, XML, etc.)
  • Various clients available (in R, Python, etc.)
  • Available for Web of Science, Scopus, OpenAlex
  • No API for Google Scholar (web scraping)

Application Programming Interface (API)

Description/Advantages

  • Command line interface
  • Good for automation & reproducibility
  • Structured raw data (JSON, XML, etc.)
  • Various clients available (in R, Python, etc.)
  • Available for Web of Science, Scopus, OpenAlex
  • No API for Google Scholar (web scraping)

Limitations

  • Requires a token (authentication)
  • Sometimes not free (WoS)
  • Number of records per request
  • Number of requests per month
  • Incomplete data (i.e. abstract)

WoS Starter API

 You need to require an API token

Metadata available:

Document identifier
Document type
Title
Authors
Published year
Published month
Source
Volume
Issue
Pages
Article number
Book editors
Keywords
DOI
ISSN
ISBN
PMID
Time cited

WoS Starter API client


 GitHub repo

WoS Starter API client

Installation:

# Installation ----
install.packages("remotes")
remotes::install_github("frbcesab/rwosstarter")

WoS Starter API client

Installation:

# Installation ----
install.packages("remotes")
remotes::install_github("frbcesab/rwosstarter")

Usage:

# Search equation ----
query <- 'AU="Casajus N"'

# Get the total number of records ----
rwosstarter::wos_search(query, database = "WOS")
## [1] 28

# Download records metadata ----
refs <- rwosstarter::wos_get_records(query, database = "WOS")

Output (no abstract):

  Reference management

Reference management software

Allows you to:

  • store, organize, and annotate references
  • retrieve metadata & full text (connection to databases)
  • easily add new references (plugins for Web browsers)
  • insert in-text citations (plugins for Word, Writer, RStudio)
  • generate a bibliography
  • share libraries and collaborate (online account)

Software comparison



EndNote Mendeley Zotero
OS          
License Proprietary (Clarivate) Proprietary (Elsevier) Open source (AGPL)
Pricing > 200 $ Free Free
Online storage 2 GB (+ pricing options) 2 GB (+ pricing options) 300 MB (+ pricing options)
Citation styles > 7000 > 7000 > 10,000
Google Doc

LaTeX

Customization

Zotero

Zotero plugins

  • Zotero connector  

    • Available for the most popular Web browsers:        
    • Add items to your Zotero library in one click
    • Add PDF (if open access)


  • Word processor plugin  

    • Available for Word, Writer and Google Doc
    • Insert in-text citations
    • Generate bibliography according to a selected style


  • Citation picker for VS Code (VS Codium)

    • Citation Picker for Zotero  

Zotero plugins

  • Better BibTeX  

    • Improve the compatibility with LaTeX, Markdown, and R
    • Auto-generate citation keys
    • Auto-export BibTeX files (one file per collection)


  • ZotFile  

    • Rename attachments (with a lot of naming rules)
    • Move attachments (to a specific location)
  • Better Notes  

    • Enhance note editor (markdown)
    • Knowledge analysis
    • Note templates
    • Export notes (PDF, DOCX, etc.)


  • DOI manager  

    • Look up DOI from CrossRef
    • Check DOI validity
    • Clean the DOI field


 Official list

Zotero and RStudio


  • Require a recent version of RStudio
  • Edit R Markdown file using the Visual R Markdown editing mode
  • Easily insert in-text citations
  • Auto-generate BibTex file

Zotero and RStudio


  • Require a recent version of RStudio
  • Edit R Markdown file using the Visual R Markdown editing mode
  • Easily insert in-text citations
  • Auto-generate BibTex file

YAML header:

---
title: "My document"
bibliography: references.bib
link-citations: true
---


 More information

  Reference access

Access from Zotero







CSV export



BibTeX export







SQL query



Zotero API

Access from Zotero







CSV export



BibTeX export







SQL query



Zotero API

Export references from Zotero

Export references as CSV file

Export references as BibTeX file

Importing BibTeX into R


 GitHub repo

Importing BibTeX into R

Installation:

# Installation ----
install.packages("remotes")
remotes::install_github("frbcesab/rbibtools")

Usage:

# Import BibTeX files ----
refs <- rbibtools::read_bib(path = "folder_with_bib_files")

# Class ----
class(refs)
## [1] "data.frame"

Output:

Query Zotero SQL database


 GitHub repo

Query Zotero SQL database

Installation:

# Installation ----
install.packages("remotes")
remotes::install_github("frbcesab/zoteror")

Usage:

# Import BibTeX files ----
refs <- zoteror::get_zotero_data(path = "folder_with_zotero.sqlite")

# Class ----
class(refs)
## [1] "data.frame"

Output:

  Remove duplicates

Detect duplicated references

Many available tools:


Let’s keep it simple:

 The duplicated() function

The duplicated() function

Easy case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change and biodiversity", 
              "Climate change and biodiversity",
              "Climate change and biodiversity",
              "Fisheries and management - Part I", 
              "Fisheries and management - Part II"))
refs
##                                title
## 1    Climate change and biodiversity
## 2    Climate change and biodiversity
## 3    Climate change and biodiversity
## 4  Fisheries and management - Part I
## 5 Fisheries and management - Part II

The duplicated() function

Easy case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change and biodiversity", 
              "Climate change and biodiversity",
              "Climate change and biodiversity",
              "Fisheries and management - Part I", 
              "Fisheries and management - Part II"))
refs
##                                title
## 1    Climate change and biodiversity
## 2    Climate change and biodiversity
## 3    Climate change and biodiversity
## 4  Fisheries and management - Part I
## 5 Fisheries and management - Part II


# Detect duplicates ----
dups <- duplicated(refs$"title")
dups
## [1] FALSE  TRUE  TRUE FALSE FALSE

The duplicated() function

Easy case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change and biodiversity", 
              "Climate change and biodiversity",
              "Climate change and biodiversity",
              "Fisheries and management - Part I", 
              "Fisheries and management - Part II"))
refs
##                                title
## 1    Climate change and biodiversity
## 2    Climate change and biodiversity
## 3    Climate change and biodiversity
## 4  Fisheries and management - Part I
## 5 Fisheries and management - Part II


# Detect duplicates ----
dups <- duplicated(refs$"title")
dups
## [1] FALSE  TRUE  TRUE FALSE FALSE


# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs
##                                title duplicate
## 1    Climate change and biodiversity         0
## 2    Climate change and biodiversity         1
## 3    Climate change and biodiversity         1
## 4  Fisheries and management - Part I         0
## 5 Fisheries and management - Part II         0

The duplicated() function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))


The duplicated() function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))


# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs
##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

The duplicated() function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))


# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs
##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))
## [1] "climate change"   "climate change"   "climate- change."

The duplicated() function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))


# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs
##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))
## [1] "climate change"   "climate change"   "climate- change."
# Remove punctuation ----
(titles <- gsub("[[:punct:]]", " ", titles))
## [1] "climate change"   "climate change"   "climate  change "

The duplicated() function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))


# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs
##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))
## [1] "climate change"   "climate change"   "climate- change."
# Remove punctuation ----
(titles <- gsub("[[:punct:]]", " ", titles))
## [1] "climate change"   "climate change"   "climate  change "
# Remove multi-whitespace ----
(titles <- gsub("\\s+", " ", titles))
## [1] "climate change"  "climate change"  "climate change "

The duplicated() function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))


# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs
##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))
## [1] "climate change"   "climate change"   "climate- change."
# Remove punctuation ----
(titles <- gsub("[[:punct:]]", " ", titles))
## [1] "climate change"   "climate change"   "climate  change "
# Remove multi-whitespace ----
(titles <- gsub("\\s+", " ", titles))
## [1] "climate change"  "climate change"  "climate change "
# Remove leading and trailing whitespace ----
(titles <- trimws(titles))
## [1] "climate change" "climate change" "climate change"

The duplicated() function

Real case

# Create dataset ----
refs <- data.frame(
  "title" = c("Climate change", 
              "CLIMATE CHANGE",
              "Climate- change."))


# Detect duplicates ----
dups <- duplicated(refs$"title")

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs
##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         0
## 3 Climate- change.         0

Clean data

# Convert to lower case ----
(titles <- tolower(refs$"title"))
## [1] "climate change"   "climate change"   "climate- change."
# Remove punctuation ----
(titles <- gsub("[[:punct:]]", " ", titles))
## [1] "climate change"   "climate change"   "climate  change "
# Remove multi-whitespace ----
(titles <- gsub("\\s+", " ", titles))
## [1] "climate change"  "climate change"  "climate change "
# Remove leading and trailing whitespace ----
(titles <- trimws(titles))
## [1] "climate change" "climate change" "climate change"


# Detect duplicates ----
dups <- duplicated(titles)

# Append results ----
refs$"duplicate" <- as.numeric(dups)
refs
##              title duplicate
## 1   Climate change         0
## 2   CLIMATE CHANGE         1
## 3 Climate- change.         1

  Extras

OpenAlex

A bibliographic catalogue of scientific papers, authors and institutions accessible in open access mode1.


  • Competes with commercial products such as Clarivate’s Web of Science or Elsevier’s Scopus.
  • Provides bibliometrics tools
  • Provides an API
  • Free account
  • Download the whole database


Website: https://openalex.org

OpenAlex

A lightweight interface


OpenAlex API R client


 GitHub repo

OpenAlex API R client

Installation:

# Installation ----
install.packages("openalexR")

Usage:

# Be polite and tell who you are ----
options(openalexR.mailto = "anonymous@mail.com")

# DOI to search for ----
dois <- c("10.1371/journal.pbio.3001640", 
          "10.1038/s41597-023-02264-2")

# Retrieve document metadata ----
metadata <- openalexR::oa_fetch(entity = "works", doi = dois)


Output (with abstract):

  Exercise

Exercise

Exercise

 Download the two following .bib files:

  • refs-scopus.bib available here
  • refs-webofscience.bib available here

 Import the two .bib files in Zotero

  • Create a collection for this exercise
  • Import each .bib file in its own subcollection

 Import references in R

  • Install the package zoteror available here
  • Use the function get_zotero_data() to import Zotero references
  • Select only references from the two collections

 Detect duplicated references

  • Use the function duplicated() on the DOI to identify duplicated references
  • Add a column with 1 (duplicate) and 0 (no duplicate)

 Export the final table

  • Use the package writexl to export the table as a xlsx file

Correction

Data acquisition

# Folder to save .bib files ----
path <- "~/Documents/Exercise"
dir.create(path, recursive = TRUE)

# Download .bib files ----
repo_url   <- paste0("https://raw.githubusercontent.com/", 
                     "literaturesynthesis/", 
                     "corpus-management/main/data/")

filename_1 <- "refs-scopus.bib"

download.file(url      = paste0(repo_url, filename_1), 
              destfile = file.path(path, filename_1))

filename_2 <- "refs-webofscience.bib"

download.file(url      = paste0(repo_url, filename_2), 
              destfile = file.path(path, filename_2))

# Zotero step ----
## ...

Correction

Data acquisition

# Folder to save .bib files ----
path <- "~/Documents/Exercise"
dir.create(path, recursive = TRUE)

# Download .bib files ----
repo_url   <- paste0("https://raw.githubusercontent.com/", 
                     "literaturesynthesis/", 
                     "corpus-management/main/data/")

filename_1 <- "refs-scopus.bib"

download.file(url      = paste0(repo_url, filename_1), 
              destfile = file.path(path, filename_1))

filename_2 <- "refs-webofscience.bib"

download.file(url      = paste0(repo_url, filename_2), 
              destfile = file.path(path, filename_2))

# Zotero step ----
## ...

Data cleaning

# Access references ----
refs <- zoteror::get_zotero_data(path = "~/zotero")

# Select collections ----
refs <- refs[refs$"collection" %in% c("WOS", "Scopus"), ]

# Detect duplicates (based on DOI) ----
duplicated_doi <- duplicated(refs$"doi")

# Store information ----
refs$"duplicated" <- ifelse(duplicated_doi, 1, 0)

# Number of duplicates ----
sum(duplicated_doi)

## Export .xlsx file ----
writexl::write_xlsx(refs, file.path(path, "unique_references.xlsx"))