Using R to convert word doc to CSV in Managed Folder

Davidknox · February 2021

Hi,

Has anyone successfully converted a word document to a CSV file using an R recipe and managed folders?

The purpose was to read the document, tokenise and export to CSV output for use elsewere.

So a version of this code worked for tokenising a CSV doc in a managed folder but this says no file exists.

## packages im using

library(dataiku)
library(quanteda)
library(tidyr)
library(readtext)

# Recipe inputs
word <- dkuManagedFolderPath("rFwE9zCu")

file <- dkuManagedFolderPartitionPaths("Word") # confirms there is a file called /endyear.docx
filepath <- dkuManagedFolderPath("Word") # confirms there is a folder path /mnt/dataiku/dss_data/managed_datasets/GETTINGPDFSTOWORK/rFwE9zCu

## load data
dataload <- dkuManagedFolderDownloadPath("Word","/endyear.docx", as="text") ## raw doesnt work

dat_word<-readtext(dataload)

## create corpus

corpus_use<-corpus(dat_word, docvars = data.frame(party = names(dat_word)))

## tokeninse

token_use<-tokens(corpus_use)

## create dtm

dtmatrix<-dfm(token_use)

## convert into dataframe
word_docs<-as.data.frame(dtmatrix) ## to transpose columns as rows

# Recipe outputs
dkuWriteDataset(word_docs,"Word_docs")

Any ideas what is wrong with code?

Sarina · February 2021

Hi @Davidknox
,

Your code largely works for me, so I think I might be missing where you are running into an issue. If the below does not help, can you attach a screenshot of the error you are receiving? It might also help to see what your input data looks like and what you expect your output data to look like.

The only adjustment that I made was that I changed the readtext function to point to the full word document file path, like this:

file_path<-gsub(" ", "", paste(folder_path, file_path))
dat_word<-readtext(file = file_path)

So the readtext line is now more like readtext(file='/long/path/to/my/worddoc'). Here's the full code I tried out that gives me an output dataset that is tokenized

library(dataiku)
library(tidyr)
library(quanteda)
library(readtext)

# Recipe inputs
folder_path <- dkuManagedFolderPath("word documents") # gives me my full path to the folder
file_path <- dkuManagedFolderPartitionPaths("word documents")   # gives the specific file document

## load data
dataload <- dkuManagedFolderDownloadPath("word documents",file_path, as="raw") ## raw doesnt work

dkuManagedFolderPathDetails("word documents", file_path)
file_path<-gsub(" ", "", paste(folder_path, file_path))
dat_word<-readtext(file = file_path)
dat_word

# -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
## create corpus
corpus_use<-corpus(dat_word, docvars = data.frame(party = names(dat_word)))
## tokeninse
token_use<-tokens(corpus_use)
## create dtm
dtmatrix<-dfm(token_use)
## convert into dataframe
word_docs<-as.data.frame(dtmatrix)  ## to transpose columns as rows
# Recipe outputs


# Compute recipe outputs
# TODO: Write here your actual code that computes the outputs
tokenizedR <- word_docs # Compute a data frame for the output to write into tokenizedR


# Recipe outputs
dkuWriteDataset(tokenizedR,"tokenizedR")

I also note that you made the comment that the as="raw" argument did not work for you. It seemed to work for me, though I didn't end up using the results. If this is where you are running into an issue, can you give a little more detail and attach a screenshot of the error as well?

Thank you,

Sarina

Davidknox · February 2021

Hi @SarinaS
,

Sadly that's not working for me. Sorry. Are you creating the corpus element in a notebook rather than the recipe? Only you make reference to 'NOTEBOOK-CELL: CODE '

To answer your question, looking at the flow I have a managed folder called 'Word' which contains a word document, a recipe (where this code is) and then an (CSV) export file which I called 'Word_docs'

My hope was to be able to drop word docs into the folder, run the recipe and an output file is created. This file will be a frequency table containing all the text from the word document. I have a notebook with code for word clouds, sentiment etc already which would reference this 'Word_docs' file.

If I managed that then the next step would be splitting the corpus into bigrams.

Thanks for your help.

Sarina · February 2021

Hi @Davidknox
,

Can you outline specifically where your code/output is not working? Please provide a screenshot of any error message and of the unexpected output. Is the issue that you are unable to read your file from the managed folder?

I'll outline the setup that I created in case that helps. I tested your code in the "Edit in notebook" version of a recipe, but did create a recipe.

Here's my flow:

Screen Shot 2021-02-16 at 1.16.44 PM.png