Using R to convert word doc to CSV in Managed Folder

Davidknox
Davidknox Registered Posts: 2 ✭✭✭✭

Hi,

Has anyone successfully converted a word document to a CSV file using an R recipe and managed folders?

The purpose was to read the document, tokenise and export to CSV output for use elsewere.

So a version of this code worked for tokenising a CSV doc in a managed folder but this says no file exists.

## packages im using
library(dataiku)
library(quanteda)
library(tidyr)
library(readtext)

# Recipe inputs
word <- dkuManagedFolderPath("rFwE9zCu")
file <- dkuManagedFolderPartitionPaths("Word") # confirms there is a file called /endyear.docx
filepath <- dkuManagedFolderPath("Word") # confirms there is a folder path /mnt/dataiku/dss_data/managed_datasets/GETTINGPDFSTOWORK/rFwE9zCu

## load data
dataload <- dkuManagedFolderDownloadPath("Word","/endyear.docx", as="text") ## raw doesnt work
dat_word<-readtext(dataload)
## create corpus
corpus_use<-corpus(dat_word, docvars = data.frame(party = names(dat_word)))
## tokeninse
token_use<-tokens(corpus_use)
## create dtm
dtmatrix<-dfm(token_use)
## convert into dataframe
word_docs<-as.data.frame(dtmatrix) ## to transpose columns as rows
# Recipe outputs
dkuWriteDataset(word_docs,"Word_docs")
Any ideas what is wrong with code?

Answers

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker
    edited July 17

    Hi @Davidknox
    ,

    Your code largely works for me, so I think I might be missing where you are running into an issue. If the below does not help, can you attach a screenshot of the error you are receiving? It might also help to see what your input data looks like and what you expect your output data to look like.

    The only adjustment that I made was that I changed the readtext function to point to the full word document file path, like this:

    file_path<-gsub(" ", "", paste(folder_path, file_path))
    dat_word<-readtext(file = file_path)

    So the readtext line is now more like readtext(file='/long/path/to/my/worddoc'). Here's the full code I tried out that gives me an output dataset that is tokenized

    library(dataiku)
    library(tidyr)
    library(quanteda)
    library(readtext)
    
    # Recipe inputs
    folder_path <- dkuManagedFolderPath("word documents") # gives me my full path to the folder
    file_path <- dkuManagedFolderPartitionPaths("word documents")   # gives the specific file document
    
    ## load data
    dataload <- dkuManagedFolderDownloadPath("word documents",file_path, as="raw") ## raw doesnt work
    
    dkuManagedFolderPathDetails("word documents", file_path)
    file_path<-gsub(" ", "", paste(folder_path, file_path))
    dat_word<-readtext(file = file_path)
    dat_word
    
    # -------------------------------------------------------------------------------- NOTEBOOK-CELL: CODE
    ## create corpus
    corpus_use<-corpus(dat_word, docvars = data.frame(party = names(dat_word)))
    ## tokeninse
    token_use<-tokens(corpus_use)
    ## create dtm
    dtmatrix<-dfm(token_use)
    ## convert into dataframe
    word_docs<-as.data.frame(dtmatrix)  ## to transpose columns as rows
    # Recipe outputs
    
    
    # Compute recipe outputs
    # TODO: Write here your actual code that computes the outputs
    tokenizedR <- word_docs # Compute a data frame for the output to write into tokenizedR
    
    
    # Recipe outputs
    dkuWriteDataset(tokenizedR,"tokenizedR")

    I also note that you made the comment that the as="raw" argument did not work for you. It seemed to work for me, though I didn't end up using the results. If this is where you are running into an issue, can you give a little more detail and attach a screenshot of the error as well?

    Thank you,

    Sarina

  • Davidknox
    Davidknox Registered Posts: 2 ✭✭✭✭

    Hi @SarinaS
    ,

    Sadly that's not working for me. Sorry. Are you creating the corpus element in a notebook rather than the recipe? Only you make reference to 'NOTEBOOK-CELL: CODE '

    To answer your question, looking at the flow I have a managed folder called 'Word' which contains a word document, a recipe (where this code is) and then an (CSV) export file which I called 'Word_docs'

    My hope was to be able to drop word docs into the folder, run the recipe and an output file is created. This file will be a frequency table containing all the text from the word document. I have a notebook with code for word clouds, sentiment etc already which would reference this 'Word_docs' file.

    If I managed that then the next step would be splitting the corpus into bigrams.

    Thanks for your help.

  • Sarina
    Sarina Dataiker, Dataiku DSS Core Designer, Dataiku DSS Adv Designer, Registered Posts: 317 Dataiker

    Hi @Davidknox
    ,

    Can you outline specifically where your code/output is not working? Please provide a screenshot of any error message and of the unexpected output. Is the issue that you are unable to read your file from the managed folder?

    I'll outline the setup that I created in case that helps. I tested your code in the "Edit in notebook" version of a recipe, but did create a recipe.

    Here's my flow:

    Screen Shot 2021-02-16 at 1.16.44 PM.png

    This is the contents of my "word documents" folder, which contains one dummy-text word file:

    Screen Shot 2021-02-16 at 1.16.52 PM.png

    Here's my R recipe:

    Screen Shot 2021-02-16 at 1.17.05 PM.png

    And here is the output dataset:

    Screen Shot 2021-02-16 at 1.17.48 PM.png

    Thanks,

    Sarina 

Setup Info
    Tags
      Help me…