preprocessReut21578XML {tm} | R Documentation |
Preprocess the Reuters21578 XML archive by correcting invalid UTF8 encoding and copying each text document into a separate file.
preprocessReut21578XML(reuters.dir, reuters.oapf.dir, fix.enc = TRUE)
reuters.dir |
a character describing the input directory. |
reuters.oapf.dir |
a character describing the output directory. |
fix.enc |
a logical value indicating whether the invalid UTF8 encoding in the Reuters21578 XML dataset should be corrected. |
No explicit return value. As a side product the directory
reuters.oapf.dir
contains the corrected dataset.
Ingo Feinerer
Lewis, David (1997) Reuters-21578 Text Categorization Collection Distribution 1.0. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Luz, Saturnino XML-encoded version of Reuters-21578. http://modnlp.berlios.de/reuters21578.html