Using R to convert word doc to CSV in Managed Folder
Hi,
Has anyone successfully converted a word document to a CSV file using an R recipe and managed folders?
The purpose was to read the document, tokenise and export to CSV output for use elsewere.
So a version of this code worked for tokenising a CSV doc in a managed folder but this says no file exists.
library(quanteda)
library(tidyr)
library(readtext)
# Recipe inputs
word <- dkuManagedFolderPath("rFwE9zCu")
filepath <- dkuManagedFolderPath("Word") # confirms there is a folder path /mnt/dataiku/dss_data/managed_datasets/GETTINGPDFSTOWORK/rFwE9zCu
## load data
dataload <- dkuManagedFolderDownloadPath("Word","/endyear.docx", as="text") ## raw doesnt work
word_docs<-as.data.frame(dtmatrix) ## to transpose columns as rows
dkuWriteDataset(word_docs,"Word_docs")
Answers
-
Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
Hi @Davidknox
,Your code largely works for me, so I think I might be missing where you are running into an issue. If the below does not help, can you attach a screenshot of the error you are receiving? It might also help to see what your input data looks like and what you expect your output data to look like.
The only adjustment that I made was that I changed the readtext function to point to the full word document file path, like this:
file_path<-gsub(" ", "", paste(folder_path, file_path))
dat_word<-readtext(file = file_path)
So the readtext line is now more like readtext(file='/long/path/to/my/worddoc'). Here's the full code I tried out that gives me an output dataset that is tokenized
library(dataiku) library(tidyr) library(quanteda) library(readtext) # Recipe inputs folder_path <- dkuManagedFolderPath("word documents") # gives me my full path to the folder file_path <- dkuManagedFolderPartitionPaths("word documents") # gives the specific file document ## load data dataload <- dkuManagedFolderDownloadPath("word documents",file_path, as="raw") ## raw doesnt work dkuManagedFolderPathDetails("word documents", file_path) file_path<-gsub(" ", "", paste(folder_path, file_path)) dat_word<-readtext(file = file_path) dat_word # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE ## create corpus corpus_use<-corpus(dat_word, docvars = data.frame(party = names(dat_word))) ## tokeninse token_use<-tokens(corpus_use) ## create dtm dtmatrix<-dfm(token_use) ## convert into dataframe word_docs<-as.data.frame(dtmatrix) ## to transpose columns as rows # Recipe outputs # Compute recipe outputs # TODO: Write here your actual code that computes the outputs tokenizedR <- word_docs # Compute a data frame for the output to write into tokenizedR # Recipe outputs dkuWriteDataset(tokenizedR,"tokenizedR")
I also note that you made the comment that the as="raw" argument did not work for you. It seemed to work for me, though I didn't end up using the results. If this is where you are running into an issue, can you give a little more detail and attach a screenshot of the error as well?
Thank you,
Sarina
-
Hi @SarinaS
,Sadly that's not working for me. Sorry. Are you creating the corpus element in a notebook rather than the recipe? Only you make reference to 'NOTEBOOK-CELL: CODE '
To answer your question, looking at the flow I have a managed folder called 'Word' which contains a word document, a recipe (where this code is) and then an (CSV) export file which I called 'Word_docs'
My hope was to be able to drop word docs into the folder, run the recipe and an output file is created. This file will be a frequency table containing all the text from the word document. I have a notebook with code for word clouds, sentiment etc already which would reference this 'Word_docs' file.
If I managed that then the next step would be splitting the corpus into bigrams.
Thanks for your help.
-
Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
Hi @Davidknox
,Can you outline specifically where your code/output is not working? Please provide a screenshot of any error message and of the unexpected output. Is the issue that you are unable to read your file from the managed folder?
I'll outline the setup that I created in case that helps. I tested your code in the "Edit in notebook" version of a recipe, but did create a recipe.
Here's my flow:
This is the contents of my "word documents" folder, which contains one dummy-text word file:
Here's my R recipe:
And here is the output dataset:
Thanks,
Sarina