Title: | Document Conversion to 'PDF' or 'PNG' |
---|---|
Description: | It provides the ability to generate images from documents of different types. Three main features are provided: functions for generating document thumbnails, functions for performing visual tests of documents and a function for updating fields and table of contents of a 'Microsoft Word' or 'RTF' document. In order to work, 'LibreOffice' must be installed on the machine and or 'Microsoft Word'. If the latter is available, it can be used to produce PDF documents or images identical to the originals; otherwise, 'LibreOffice' is used and the rendering can be sometimes different from the original documents. |
Authors: | David Gohel [aut, cre], ArData [cph], David Hajage [ctb] (initial powershell code) |
Maintainer: | David Gohel <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.3.0001 |
Built: | 2025-01-04 05:41:55 UTC |
Source: | https://github.com/ardata-fr/doconv |
Test if 'LibreOffice' can export to PDF. An attempt to export to PDF is made to confirm that the PDF export is functional.
check_libreoffice_export(UserInstallation = NULL)
check_libreoffice_export(UserInstallation = NULL)
UserInstallation |
use this value to set a non-default user profile path for "LibreOffice". If not provided a temporary dir is created. It makes possibles to use more than a single session of "LibreOffice." |
a single logical value.
library(locatexec) if(exec_available("libreoffice")){ check_libreoffice_export() }
library(locatexec) if(exec_available("libreoffice")){ check_libreoffice_export() }
Update all fields and table of contents of a Word document using "Microsoft Word". This function will not work if "Microsoft Word" is not available on your machine.
The calls to "Microsoft Word" are made differently depending on the operating system. On "Windows", a "PowerShell" script using COM technology is used to control "Microsoft Word". On macOS, an "AppleScript" script is used to control "Microsoft Word".
docx_update(input)
docx_update(input)
input |
file input |
the name of the produced pdf (the same value as output
)
library(locatexec) if (exec_available('word')) { file <- system.file(package = "doconv", "doc-examples/example.docx") docx_out <- tempfile(fileext = ".docx") file.copy(file, docx_out) docx_update(input = docx_out) }
library(locatexec) if (exec_available('word')) { file <- system.file(package = "doconv", "doc-examples/example.docx") docx_out <- tempfile(fileext = ".docx") file.copy(file, docx_out) docx_update(input = docx_out) }
Convert docx to pdf directly using "Microsoft Word". This function will not work if "Microsoft Word" is not available on your machine.
The calls to "Microsoft Word" are made differently depending on the operating system:
On "Windows", a "PowerShell" script using COM technology is used to control "Microsoft Word". The resulting PDF is containing a browsable TOC.
On macOS, an "AppleScript" script is used to control "Microsoft Word". The resulting PDF is not containing a browsable TOC as when on 'Windows'.
docx2pdf(input, output = gsub("\\.(docx|doc|rtf)$", ".pdf", input))
docx2pdf(input, output = gsub("\\.(docx|doc|rtf)$", ".pdf", input))
input , output
|
file input and optional file output (default to input with pdf extension). |
the name of the produced pdf (the same value as output
)
If your execution policy is set to "RemoteSigned", 'doconv' will not be able to run powershell script. Set it to "Unrestricted" and it should work. If you are in a managed and administrated environment, you may not be able to use 'doconv' because of execution policies.
On macOS the call is happening into a working
directory managed with function working_directory()
.
Manual interventions are necessary to authorize 'Word' and 'PowerPoint' applications to write in a single directory: the working directory. These permissions must be set manually, this is required by the macOS security policy. We think that this is not a problem because it is unlikely that you will use a Mac machine as a server.
You must click "allow" two times to:
allow R to run 'AppleScript' scripts that will control Word
allow Word to write to the working directory.
This process is a one-time operation.
library(locatexec) if (exec_available('word')) { file <- system.file(package = "doconv", "doc-examples/example.docx") out <- docx2pdf(input = file, output = tempfile(fileext = ".pdf")) if (file.exists(out)) { message(basename(out), " is existing now.") } }
library(locatexec) if (exec_available('word')) { file <- system.file(package = "doconv", "doc-examples/example.docx") out <- docx2pdf(input = file, output = tempfile(fileext = ".pdf")) if (file.exists(out)) { message(basename(out), " is existing now.") } }
This expectation can be used with 'tinytest' and 'testthat'
to check if a current document of type pdf, docx, doc, rtf, pptx or png
matches a target document. When the expectation is checked
for the first time, the expectation fails and a target miniature
of the document is saved in a folder named _tinytest_doconv
or
_snaps
.
expect_snapshot_doc( name, x, tolerance = 0.001, engine = c("tinytest", "testthat") )
expect_snapshot_doc( name, x, tolerance = 0.001, engine = c("tinytest", "testthat") )
name |
a string to identify the test. Each document in the test suite must have a unique name. |
x |
file path of a document |
tolerance |
the ratio of different pixels that is acceptable before triggering a failure. |
engine |
test package being used in the test suite, one of "tinytest" or "testthat". |
A tinytest::tinytest()
or a testthat::expect_snapshot_file object.
file <- system.file(package = "doconv", "doc-examples/example.docx") ## Not run: if (require("tinytest") && msoffice_available()){ # first run add a new snapshot expect_snapshot_doc(x = file, name = "docx file", engine = "tinytest") # next runs compare with the snapshot expect_snapshot_doc(x = file, name = "docx file", engine = "tinytest") # cleaning directory unlink("_tinytest_doconv", recursive = TRUE, force = TRUE) } if (require("testthat") && msoffice_available()){ local_edition(3) # first run add a new snapshot expect_snapshot_doc(x = file, name = "docx file", engine = "testthat") # next runs compare with the snapshot expect_snapshot_doc(x = file, name = "docx file", engine = "testthat") } ## End(Not run)
file <- system.file(package = "doconv", "doc-examples/example.docx") ## Not run: if (require("tinytest") && msoffice_available()){ # first run add a new snapshot expect_snapshot_doc(x = file, name = "docx file", engine = "tinytest") # next runs compare with the snapshot expect_snapshot_doc(x = file, name = "docx file", engine = "tinytest") # cleaning directory unlink("_tinytest_doconv", recursive = TRUE, force = TRUE) } if (require("testthat") && msoffice_available()){ local_edition(3) # first run add a new snapshot expect_snapshot_doc(x = file, name = "docx file", engine = "testthat") # next runs compare with the snapshot expect_snapshot_doc(x = file, name = "docx file", engine = "testthat") } ## End(Not run)
This expectation can be used with 'tinytest' and 'testthat'
to check if a current document of type HTML
matches a target document. When the expectation is checked
for the first time, the expectation fails and a target miniature
of the document is saved in a folder named _tinytest_doconv
or
_snaps
.
expect_snapshot_html( name, x, tolerance = 0.001, engine = c("tinytest", "testthat"), ... )
expect_snapshot_html( name, x, tolerance = 0.001, engine = c("tinytest", "testthat"), ... )
name |
a string to identify the test. Each document in the test suite must have a unique name. |
x |
file path of an HTML document |
tolerance |
the ratio of different pixels that is acceptable before triggering a failure. |
engine |
test package being used in the test suite, one of "tinytest" or "testthat". |
... |
arguments used by |
A tinytest::tinytest()
or a testthat::expect_snapshot_file object.
file <- tempfile(fileext = ".html") html <- paste0("<html><head><title>hello</title></head>", "<body><h1>Hello World</h1></body></html>\n") cat(html, file = file) ## Not run: if (require("tinytest") && require("webshot2")){ # first run add a new snapshot expect_snapshot_html(x = file, name = "html file", engine = "tinytest") # next runs compare with the snapshot expect_snapshot_html(x = file, name = "html file", engine = "tinytest") # cleaning directory unlink("_tinytest_doconv", recursive = TRUE, force = TRUE) } if (require("testthat") && require("webshot2")){ local_edition(3) # first run add a new snapshot expect_snapshot_html(x = file, name = "html file", engine = "testthat") # next runs compare with the snapshot expect_snapshot_html(x = file, name = "html file", engine = "testthat") } ## End(Not run)
file <- tempfile(fileext = ".html") html <- paste0("<html><head><title>hello</title></head>", "<body><h1>Hello World</h1></body></html>\n") cat(html, file = file) ## Not run: if (require("tinytest") && require("webshot2")){ # first run add a new snapshot expect_snapshot_html(x = file, name = "html file", engine = "tinytest") # next runs compare with the snapshot expect_snapshot_html(x = file, name = "html file", engine = "tinytest") # cleaning directory unlink("_tinytest_doconv", recursive = TRUE, force = TRUE) } if (require("testthat") && require("webshot2")){ local_edition(3) # first run add a new snapshot expect_snapshot_html(x = file, name = "html file", engine = "testthat") # next runs compare with the snapshot expect_snapshot_html(x = file, name = "html file", engine = "testthat") } ## End(Not run)
The function test if 'Microsoft Office' is available.
msoffice_available()
msoffice_available()
a single logical value.
msoffice_available()
msoffice_available()
Convert pptx to pdf directly using "Microsoft PowerPoint". This function will not work if "Microsoft PowerPoint" is not available on your machine.
The calls to "Microsoft PowerPoint" are made differently depending on the operating system. On "Windows", a "PowerShell" script using COM technology is used to control "Microsoft PowerPoint". On macOS, an "AppleScript" script is used to control "Microsoft PowerPoint".
pptx2pdf(input, output = gsub("\\.pptx$", ".pdf", input))
pptx2pdf(input, output = gsub("\\.pptx$", ".pdf", input))
input , output
|
file input and optional file output (default to input with pdf extension). |
the name of the produced pdf (the same value as output
)
On macOS the call is happening into a working
directory managed with function working_directory()
.
Manual interventions are necessary to authorize 'PowerPoint' applications to write in a single directory: the working directory. These permissions must be set manually, this is required by the macOS security policy. We think that this is not a problem because it is unlikely that you will use a Mac machine as a server.
You must also click "allow" two times to:
allow R to run 'AppleScript' scripts that will control PowerPoint
allow PowerPoint to write to the working directory.
This process is a one-time operation.
library(locatexec) if (exec_available('powerpoint')) { file <- system.file(package = "doconv", "doc-examples/example.pptx") out <- pptx2pdf(input = file, output = tempfile(fileext = ".pdf")) if (file.exists(out)) { message(basename(out), " is existing now.") } }
library(locatexec) if (exec_available('powerpoint')) { file <- system.file(package = "doconv", "doc-examples/example.pptx") out <- pptx2pdf(input = file, output = tempfile(fileext = ".pdf")) if (file.exists(out)) { message(basename(out), " is existing now.") } }
Convert a file into an image (magick image) where the pages are arranged in rows, each row can contain one to several pages.
The result can be saved as a png file.
to_miniature( filename, row = NULL, width = NULL, border_color = "#ccc", border_geometry = "2x2", dpi = 150, fileout = NULL, timeout = 120, ... )
to_miniature( filename, row = NULL, width = NULL, border_color = "#ccc", border_geometry = "2x2", dpi = 150, fileout = NULL, timeout = 120, ... )
filename |
input filename, supported documents are 'Microsoft Word', 'Microsoft PowerPoint', 'RTF' and 'PDF' document. |
row |
row index for every pages. 0 are to be used to drop the page from the final minature.
|
width |
width of a single image, recommanded values are:
|
border_color |
border color, see |
border_geometry |
border geometry to be added around
images, see |
dpi |
resolution (dots per inch) to use for images, see |
fileout |
if not NULL, result is saved in a png file whose filename is defined by this argument. |
timeout |
timeout in seconds that libreoffice is allowed to use in order to generate the corresponding pdf file, ignored if 0. |
... |
arguments used by webshot2 when HTML document. |
a magick image object as returned by image_read()
.
library(locatexec) docx_file <- system.file( package = "doconv", "doc-examples/example.docx" ) if(exec_available("word")) to_miniature(docx_file) pptx_file <- system.file( package = "doconv", "doc-examples/example.pptx" ) if(exec_available("libreoffice") && check_libreoffice_export()) to_miniature(pptx_file)
library(locatexec) docx_file <- system.file( package = "doconv", "doc-examples/example.docx" ) if(exec_available("word")) to_miniature(docx_file) pptx_file <- system.file( package = "doconv", "doc-examples/example.pptx" ) if(exec_available("libreoffice") && check_libreoffice_export()) to_miniature(pptx_file)
Convert documents to pdf with a script using 'Office' or 'Libre Office'.
If 'Microsoft Word' and 'Microsoft PowerPoint' are available, files 'docx', 'doc', 'rtf' and 'pptx' will be converted to PDF with 'Office' via a script.
If 'Microsoft Word' and 'Microsoft PowerPoint' are not available (on linux for example), 'Libre Office' will be used to convert documents. In that case the rendering can be different from the original document. It supports very well 'Microsoft PowerPoint' to PDF. 'Microsoft Word' can also be converted but some Word features are not supported, such as sections.
to_pdf( input, output = gsub("\\.[[:alnum:]]+$", ".pdf", input), timeout = 120, UserInstallation = NULL )
to_pdf( input, output = gsub("\\.[[:alnum:]]+$", ".pdf", input), timeout = 120, UserInstallation = NULL )
input , output
|
file input and optional file output. If output file is not provided, the value will be the value of input file with extension 'pdf'. |
timeout |
timeout in seconds, ignored if 0. |
UserInstallation |
use this value to set a non-default user profile path for 'LibreOffice'. If not provided a temporary dir is created. It makes possibles to use more than a single session of 'LibreOffice'. |
the name of the produced pdf (the same value as output
),
invisibly.
On some Ubuntu platforms, 'LibreOffice' require to add in
the environment variable LD_LIBRARY_PATH
the following path:
/usr/lib/libreoffice/program
(you should see the message
"libreglo.so cannot open shared object file" if it is the case). This
can be done with R
command Sys.setenv(LD_LIBRARY_PATH = "/usr/lib/libreoffice/program/")
library(locatexec) if (exec_available("libreoffice") && check_libreoffice_export()) { out_pptx <- tempfile(fileext = ".pdf") file <- system.file(package = "doconv", "doc-examples/example.pptx") to_pdf(input = file, output = out_pptx) out_docx <- tempfile(fileext = ".pdf") file <- system.file(package = "doconv", "doc-examples/example.docx") to_pdf(input = file, output = out_docx) }
library(locatexec) if (exec_available("libreoffice") && check_libreoffice_export()) { out_pptx <- tempfile(fileext = ".pdf") file <- system.file(package = "doconv", "doc-examples/example.pptx") to_pdf(input = file, output = out_pptx) out_docx <- tempfile(fileext = ".pdf") file <- system.file(package = "doconv", "doc-examples/example.docx") to_pdf(input = file, output = out_docx) }
Initialize or remove working directory used when docx2pdf create the PDF.
On 'macOS', the operation require writing rights to the directory by the Word or PowerPoint program. Word or PowerPoint program must be authorized to write in the directories, if the authorization does not exist, a manual confirmation window is launched, thus preventing automation.
Fortunately, users only have to do this once. The package implementation use only one directory where results are saved in order to have only one time to click this confirmation.
This directory is managed by R function R_user_dir()
. Its value can be
read with the working_directory()
function. The directory can be
deleted with rm_working_directory()
and created with init_working_directory()
.
Each call will remove that directory when completed.
As a user, you do not have to use these functions because they are called
automatically by the docx2pdf()
function. They are provided to meet
the requirements of CRAN policy:
"[...] packages may store user-specific data, configuration and cache files in their respective user directories [...], provided that by default sizes are kept as small as possible and the contents are actively managed (including removing outdated material)."
working_directory() rm_working_directory() init_working_directory()
working_directory() rm_working_directory() init_working_directory()