About the DeLINEATE Toolbox   Download   Documentation   Contact/Contribute 

 

 

PREFACE TO ALL DOCUMENTATION: We have tried to be as comprehensive, helpful, and accurate as we can in all of these documents, but providing good documentation is always an uphill climb. Our code, and the underlying backend code we rely on, is always changing, which means things can easily go out of date; and doing these kinds of analyses is intrinsically a complicated process, which makes it hard to write documentation that works well for people at all different levels of technical proficiency and familiarity with the underlying concepts and technologies. We really don't want the learning curve to be a barrier to people using this toolbox, so we highly recommend -- especially while the number of users is relatively small and manageable -- getting in touch with the developers if you're confused, don't know where to start, etc., etc. And, of course, if you find any errors, omissions, or inconsistencies! Seriously, don't be a stranger... we are happy to add features, flesh out documentation, walk through setup, and so on, to help this project serve as many users as possible, but that requires hearing from you to help let us know what areas need the most attention. Please see our website (http://delineate.it/) and click on the contact page to get a link to our Bitbucket repo and an email address for the project devs.


DTOutput (class)


This class handles all matters pertaining to the production of output from an analysis. Your master DTAnalysis object will have one DTOutput object that is responsible for doing that job when the time comes.

Like the other main classes in this toolbox, a DTOutput object is typically created by DTJob from a specification contained in a JSON-format job file, although you can also create one manually in code if you know what you're doing.



Attributes

output_location (string): A path for the directory where the output should go. Can be an absolute path, or a relative one; if the latter, it will be as interpreted as being relative to the current Python working directory. If the directory doesn't exist already, the toolbox will attempt to create it. If this attribute is not specified, it defaults to delineate_output.

output_filename_stem (string): A base filename for any output to be created; none of the output will get named EXACTLY this, but this base + various suffixes will be used for the various output types. For example, if you provide my_2layer_cnn_analysis for this attribute, and you specify accuracies and labels for your output types, you'll get files named my_2layer_cnn_analysis_accs.tsv and my_2layer_cnn_analysis_labels.tsv.

If this attribute is not specified, some kind of default will be created, but you probably won't like it -- if you're using a JSON job file, it's going to be an MD5 hash of the contents of the job file, so it's going to look like gibberish. (The upside is that it is virtually guaranteed to be unique for each job file.) If there's no job file and you're just writing your own Python code, the default is default_output_filename_stem. You can find some more detail in the DTJob documentation under the job_file_hash attribute, if you just can't get enough of this stuff.

There is one special value you can set this to, which is the string json. If you provide json as the output filename stem, your outputs will not get named json_accs.tsv and so forth. Instead, they will use the name of the JSON job file you're using as their stem. Obviously, this option only makes sense if you are using the toolbox with JSON job files, not if you are writing your own Python code. It also probably only makes sense if you have a single job per JSON file, and that JSON file has a descriptive name that you want propagated to your outputs. But if you are working that way, it can be convenient, because it makes one less thing that you might forget to update if you are doing a lot of tweaking and iterating of job files. Note that this option is technically processed in DTJob before it ever gets to DTOutput, but since the attribute is described in this part of the documentation, we're just going to sweep that part under the rug.

output_file_types (string or list of strings): One or more type codes for the types of output files you'd like to be produced in this analysis. Type codes can be any of the following:

  • all: A catch-all code if you just want everything. It will be expanded to ["test_acc", "scores", "labels", "training_acc", "timestamps", "job_config", "metadata"], and if you're doing a Keras (not PyMVPA) analysis, add "validation_acc" and "trained_model" to that list as well. Technically this is not all the possible output types, but it is the set that you are most likely to care about if you are the sort of devil-may-care flibbertigibbet who cavalierly specifies "all" to a very important output parameter. Hey, what do you expect? An "everything" bagel doesn't come with anchovies, gumballs, corgis, and uranium on it, does it? Context is key. (Note that you can't specify all and then follow it with some extra stuff after it... if you use all, it should be the only thing in output_file_types.)

  • test_acc: A file of accuracies from testing. Also includes a column for loss function values, which are only meaningful for Keras analyses; for PyMVPA, all the loss values will default to -1. (There is the possibility that we will separate out accuracies and losses in the future, but for now, they are combined in one file.) Outputs will be suffixed with _accs. Note: An older synonym for test_acc was acc_summary, and at the moment, using acc_summary will still work, but it is deprecated and will be removed at some point in the future. So, just use test_acc.

  • scores: A file of raw classification scores (from testing). How exactly these are scaled and how the file is formatted will depend on what classifier you're using, but basically they're the raw values that went into each classification decision, if you want a finer-grained reading on how the data were classified. Outputs will be suffixed with _scores.

  • labels: A file of class labels for the test dataset. Again, the exact formatting will depend on what classifier you're using (mainly, whether it's Keras or PyMVPA), but basically, it'll tell you the ACTUAL category of each of your test trials/samples/examples/whatever. If you want to double-check our work, you should be able to combine the info from this file with that of the scores file to recreate the accuracies in the test_acc file (i.e., by comparing the highest-scoring class for each trial against the true labels). Outputs will be suffixed with _labels.

  • training_acc: A file of accuracies from training. Also includes a column for loss function values, which are only meaningful for Keras analyses; for PyMVPA, all the loss values will default to -1. (There is the possibility that we will separate out accuracies and losses in the future, but for now, they are combined in one file.) Outputs will be suffixed with _training_acc.

  • validation_acc: A file of accuracies from validation. Also includes a column for loss function values. (There is the possibility that we will separate out accuracies and losses in the future, but for now, they are combined in one file.) This output type is only meaningful for Keras analyses, not PyMVPA. Outputs will be suffixed with _validation_acc.

  • timestamps: A file of timestamps from when each iteration (and fold, if relevant for your cross-validation scheme) finished. Technically, it is a timestamp of when the file itself was written to, not the very instant the analysis completed, but those numbers should be pretty close to one another in any reasonable situation. If you want a timestamp from when the overall analysis started (to allow you to calculate how long the first iteration/fold took to run, since the timestamps file only includes ending timestamps), you will find one of those in the metadata output type. Outputs will be suffixed with _timestamps.

  • metadata: A file of various metadata about the analysis. This is one of the newest output types and perhaps subject to change, but at present it includes the time the analysis was started, username of the user running the analysis, hostname of the computer the analysis is running on, location of the DeLINEATE toolbox (i.e., what directory the currently running toolbox is in, in case you have multiple copies/versions installed), Python version, and DeLINEATE toolbox version. In addition, PyMVPA analyses will include the PyMVPA version, and Keras analyses will include the Keras version, CUDA version, and name/version of Keras's machine learning backend (Theano/TensorFlow/etc.). Some metadata values may not always be determinable, but we try to fill in what we can. Outputs will be suffixed with _metadata.

  • trained_model: A file containing the trained model (neural network) for each iteration of the analysis (and fold, if that's relevant for your cross-validation scheme). Currently only works for Keras analyses. This will be saved in Keras's native format, so you'll need to use Keras functions directly if you want to do anything with this output (for example, apply the model to an entirely separate dataset, outside the auspices of our toolbox, although someday we may incorporate functionality for this). Outputs will be suffixed with _trained_model_ITERATIONNUMBER_FOLDNUMBER.h5.

  • tags:TAGNAME: This one's a bit weird. It's the only output type for which the code is not always the same, and it's the only output type you might want to specify more than once. Output files will contain tags for each trial/sample/example/whatever in your test dataset, kind of similar to the labels output type, but for arbitrary sample attributes, not necessarily the class labels used in classification. This is useful if your dataset has additional "sample attributes" (SAs; see DTData documentation for more details) beyond just class/category labels, and you want to know what those are for your test dataset. For example, you might have each sample of data tagged with a trial ID and/or a participant ID, and you want to know which participants/trials were included in the test dataset for some kind of subsequent analysis. In this case, you'll specify one tags output code for each SA you want to output, with the name of the SA after the colon. For example, if your dataset includes SAs named trialID and subjectID, and you want to include both of those tags in the output, you'd specify tags:trialID and tags:subjectID in your list of output type codes. Note that tags are not included in the all output type code, so if you want them, you have to specify them manually. Outputs will be suffixed with _tags_TAGNAME.

  • job_config: A JSON file containing the job currently being run. This is mostly an option so that, even if you mislabel your outputs or something, or change your original JSON job file after the fact, you can always have a copy of what was actually run alongside your other outputs, so you can remind yourself what you did. Note that this is not a direct copy of the JSON file you actually ran; it is a re-spitting-out of the same information after it has been read in and converted into the DeLINEATE internal job structure format. So it should be a valid job file, and you could run it if you wanted to, but it won't be bit-for-bit identical to your input JSON file, probably. (For example, the orders of various options will probably be different, and things like spacing.) Relatedly, if your input JSON file contains a list with more than one job in it, the output created by the job_config output option will only contain a single job, namely the one currently running. Outputs will be suffixed with _job_config.json.

  • madame_kerasana: Who knows what mysteries of the universe are contained in our data? Madame Kerasana does. Outputs will be suffixed with _madame_kerasana.

pre_existing_file_behavior (string): One of several possible string codes for the intended behavior if it happens that an output file already exists when DTOutput goes to create it. These codes can be any of the following (default is silent_append):

  • prompt: If this happens, interactively prompt (on the command line) to see what the user wants the toolbox to do. The options given by the interactive prompt will basically be the same as the remaining bullets in this list, with one additional option -- to simply quit.

  • overwrite: Throw caution to the wind and just overwrite the old output! For obvious reasons, not really recommended unless you really know for sure that you want this behavior. This option is for people who back their cars out of the driveway without looking, because it'll probably be fine... or for people who do it because they want to get a little demolition derby action going.

  • silent_append: Our most pragmatic and popular option. Just appends the new run onto the old output files without making a fuss, other than to add a few blank lines to make it clear what happened. Often this is the behavior you want anyway (e.g., if you want to run additional iterations of a previously completed analysis and tack them on to what was run before), but even if the duplication of output filenames is a mistake, this option is fairly non-destructive as such things go... it just might entail some detective work to realize what happened after the fact, what with it being silent and all.

  • increment: Solve the problem of duplicated output filenames by adding a numeric suffix and making them not duplicated anymore. Should work fine in almost all cases. If you have very weird filenames with lots of numbers in them or something, there is some chance that our algorithm for working out how to increment the filename could guess wrong, but in that case you'd probably just end up with some weirdly-named files... there shouldn't be any real risk of overwriting anything or losing data.

delimiter (string): Here's an easy one -- this is just what you want delimiting the cells in most of the output file types. Defaults to a tab character, but can be any string (a comma would be the next most-popular option, probably). Accepts C/Python-style escape sequences (e.g. \t for tab).

file_extension (string): Another easy one -- what extension you want most output files to end with. This only counts for the majority of the output files that are spreadsheet-like, i.e., it does not apply to trained models (which get a .h5 extension) or JSON job files (which get a .json extension). If you are providing a standard filename extension with a period in it, include the period in this string -- the toolbox does not try to guess if it should add that in. Default is .tsv for tab-separated values. Second place would probably be .csv, for comma-separated values. It may make sense to consider this option at the same time as the delimiter option directly above, what with CSV implying comma delimiters and TSV implying tab delimiters and such forth.

job_file_hash (string): This contains the MD5 hash mentioned above under output_filename_stem that is used as an alternate output filename stem if output_filename_stem is not provided. Users should not have to interact with this attribute unless you are doing something very strange; if you want to affect the filename of the output, just put something in output_filename_stem.

job_struct (dictionary): A copy of the job structure currently being run, for purposes of the job_config output option. If you are using JSON job files, the toolbox will take care of this for you, so you don't have to worry about setting it. If you are writing your own Python code, you may not have any need for the job_config output option anyway. But if you are writing your own Python code and you are setting up job structures manually (rather than using the other toolbox classes directly and avoiding the concept of job structures altogether), and you want to output a job_config file, then you may need to set this attribute. If so, job_struct should be one valid dictionary representing a single job (not a list of dictionaries containing multiple jobs). On the off-chance you actually want to do this, you could check out the DTJob documentation for some more info on job structures.

_analysis (proxy to DTAnalysis object): A proxy using a weak reference to the parent DTAnalysis object of this DTOutput object. (If you are unfamiliar with these concepts and want to become familiar, see the Python documentation for the weakref module.) This attribute is present so that DTOutput can reach up into the analysis object to get all the information it needs to create output. This is an implementation detail and said implementation is slightly tricky, so even users writing their own Python code should not interact directly with this attribute (and JSON job file users don't need to worry about it at all). However, if you are writing your own Python code, and not wrapping up everything in a DTJob object (which normally takes care of these details for you), you need a way to tell the DTOutput object what analysis it's associated with before running the analysis and creating output; in that case, you should use the set_analysis() method described below instead of accessing the _analysis attribute directly.



Methods

Note that most users won't need to invoke these directly if they are creating their analyses via JSON job files, but some brief descriptions are provided for the brave few considering writing their own scripts. As always, if you are considering writing your own scripts, you might want to contact the devs for inside details.


__init__( self, output_location='delineate_output', output_filename_stem=None, output_file_types=None, pre_existing_file_behavior='silent_append', delimiter='\t', file_extension='.tsv', job_file_hash='default_output_filename_stem', job_struct=None, check_existing=True )

(no return value) Initializer function for creating a new DTOutput object. Pretty much just assigns all the object's attributes, plus a little basic checking of the output_file_types and output_filename_stem parameters. All of the arguments are optional at this point in time, but most of them will need to get assigned one way or another before you can actually use the object.

If the check_existing parameter is set to True, here at initialization is also where we will check for pre-existing output files that collide with the specified output filename(s), and do whatever the pre_existing_file_behavior says if any duplicates are found. Normally you should leave check_existing set to its default of True and only use pre_existing_file_behavior to control what happens in such cases. The only reason check_existing exists is an implementation detail having to do with how we validate arguments in job files (the laziest possible way, by just creating a temporary DTOutput object and seeing if it encounters any errors; in which case we don't need/want to do any interactive duplicate-file checking)... but if check_existing is False, then pre_existing_file_behavior gets ignored and any duplicate files will just get appended to. So, in short, you didn't really need to know all that because end users should not generally touch check_existing.


write_output( self, iteration, fold=0, do_headers=False, do_config=False )

(no return value) The main method that most end users might ever conceivably interact with. If everything else is fully set up in your analysis, this function pretty much takes care of all the output-writing duties, and you don't really need to worry about any of its numerous sub-functions. Even that is a bit of a stretch, since normally DTAnalysis will call this itself when output-writing is needed, but if for some reason you ever need to write output manually, this method will probably take care of your needs. If that's the case, normally you would call this at the end of each iteration (or fold, if relevant for your cross-validation scheme) of the analysis, and it will write whatever output files you have the DTOutput object configured for.

The iteration parameter is mandatory and should be an integer. fold is optional but should also be an integer if it is specified. do_headers should be True if you want to write column headers for all spreadsheet-like output files (usually just on the very first iteration/fold and not thereafter). do_config should be True if you actually want to write out the JSON job config file this time (assuming that is one of the output file types you have requested to be written); normally you would only write this file out once per analysis, so much like do_headers, do_config should generally only be True on the first iteration/fold of an analysis.

Many of the subsequent methods follow the basic pattern established by write_output(), so we're going to be pretty brief describing those. For more details on what each of the output types actually is, see the output_file_types entry in the Attributes section above.


write_a_file( self, this_output_file_type, iteration, fold=0, do_headers=False, do_config=False )

(no return value) Whereas write_output() above writes out all the outputs your DTOutput object is configured for, this sub-function just writes out one file, as specified by the this_output_file_type parameter (which should be one of the strings described under the output_file_types attribute above). Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_acc_summary( self, iteration, fold=0, do_headers=False)

(no return value) Writes a line of (testing) accuracy and loss function values. Optionally writes a line of headers as well if do_headers is True. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_scores( self, iteration, fold=0, do_headers=False)

(no return value) Writes a line of raw classification scores (from testing). Optionally writes a line of headers as well if do_headers is True. Calls one of the two methods below to do its dirty work. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_scores_generic( self, iteration, fold, do_headers)

(no return value) Writes a line of raw classification scores (from testing), for any case other than PyMVPA SVM analyses with 3+ classes; thanks to the fact that different PyMVPA classifiers report their output differently, we have to do the same and implement special output-writing functions for certain cases. Optionally writes a line of headers as well if do_headers is True. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_scores_pymvpa_multiclass( self, iteration, fold, do_headers)

(no return value) Writes a line of raw classification scores (from testing), for the case of PyMVPA SVM analyses with 3+ classes; thanks to the fact that different PyMVPA classifiers report their output differently, we have to do the same and implement special output-writing functions for certain cases. Optionally writes a line of headers as well if do_headers is True. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_training_acc( self, iteration, fold, do_headers)

(no return value) Writes a line of accuracy and loss function values from training. Optionally writes a line of headers as well if do_headers is True. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_validation_acc( self, iteration, fold, do_headers)

(no return value) Writes a line of accuracy and loss function values from validation. Optionally writes a line of headers as well if do_headers is True. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_labels( self, iteration, fold, do_headers)

(no return value) Writes a line of class labels for the testing trials/samples/examples/whatever. Optionally writes a line of headers as well if do_headers is True. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_metadata( self, iteration, fold, do_headers)

(no return value) Writes a line of metadata about the analysis, hardware, software, etc. Optionally writes a line of headers as well if do_headers is True. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_trained_model( self, iteration, fold )

(no return value) Saves out the trained model (neural network) from the current iteration/fold of the analysis. Currently only works for Keras analyses. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_tags( self, iteration, this_output_file_type, fold=0, do_headers=False)

(no return value) Writes a line of the specified sample attribute "tags" for the testing trials/samples/examples/whatever. Optionally writes a line of headers as well if do_headers is True. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_output_kerasana( self, iteration, fold )

(no return value) Ah, Madame Kerasana. A mysterious and exotic method we first met in a distant land after a two-week spirit quest involving massive amounts of peyote, body paint, Cool Ranch Doritos, clothespins, and Earl Grey tea. Legend has it that for the pure of heart, her predictions always come true. But every time we use her services, we feel a faint throbbing in our left big toe, which we assume is most likely from an old croquet injury and almost definitely not the result of an ancient curse we are gradually causing to awaken.

Users would not normally interact with this method directly, and if they did, their eyeballs would probably turn to grape jelly and ooze right out of their sockets. See the notes on write_output() above for more information... if you dare.


write_output_timestamps( self, iteration, fold=0, do_headers=False)

(no return value) Writes a line with a timestamp in ISO 8601 format, e.g., 1981-01-09.08:56:12.345678. Optionally writes a line of headers as well if do_headers is True. Calls one of the two methods below to do its dirty work. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


write_job_config( self )

(no return value) Saves out a JSON format job file corresponding to the current analysis. Users would not normally interact with this method directly. See the notes on write_output() above for more information.


set_analysis( self, analysis )

(no return value) Creates a proxy to the DTAnalysis object analysis that is passed in using a weak reference, and saves that in the DTOutput object's _analysis attribute. This is because creating various outputs requires reaching up into the analysis object and rummaging around to get all the details about the current round of classification or whatever. Since our DTOutput object is typically owned by the DTAnalysis object, a weak reference is the "right" way to do this without creating a circular object graph. (If you are unfamiliar with these concepts and want to become familiar, see the Python documentation for the weakref module or just search the web for information on weak references more generally.)

Most of the time, even users writing their own Python code won't need to access this method, as DTJob objects usually call it and take care of everything when setting up an analysis prior to running it. But if you're writing your own code and not wrapping everything up in a DTJob (or, maybe, if you're doing something pretty weird with DTJob that causes this method not to be called), you might need to use it. Usage is pretty simple -- once you've got your DTAnalysis and DTOutput objects all configured otherwise (let's call them myAnalysis and myOutput), and you have already set myOutput to be the output_handler attribute of myAnalysis, just call this method as follows:

myOutput.set_analysis( myAnalysis )

Check out the _analysis entry under the Attributes section above for a bit more info.


canonicalize_all_output_setting( self, just_checking_filenames=False )

(no return value) If the output_file_types attribute is set to all, this method figures out what all should actually mean in the current context and turns output_file_types into a list of the relevant output type codes. The just_checking_filenames argument is only there to handle a weird corner case where the user has specified all output types but has not defined a DTModel yet in their analysis, but they still want to check and see if any of the potential output filenames exist... you know what, it's weird. Don't worry about it too much. Users would not normally interact with this method directly anyway.


build_a_filename_workshop( self, this_output_file_type )

(returns a filename string) For a given output file type specified in this_output_file_type, coupled with the output_location, output_filename_stem, and file_extension attributes that are assumed to be already defined for the current DTOutput object, this method returns what the output filename should be. (Or, in the case of trained Keras models, the general format of the output filename, with a couple of values to be filled in later.) Users would not normally interact with this method directly.


check_existing( self )

(no return value) Checks to see if any of the output files that are set to be generated based on the current state of the DTOutput object already exist; if they do, this method will take whatever action is specified in the pre_existing_file_behavior attribute (see Attributes section above). Typically this is called once per DTOutput object, when it is first initialized and before it has generated any output... since once it has generated output, those files will have to exist! Anyway, that is all taken care of in the __init()__ method; users would not normally interact with this method directly.



CLASS METHODS

write_headers( cls, filename, header_spec_list, delimiter="\t" )

(no return value) Writes out a series of headers to a text file. This is always done via appending; if a file is supposed to be overwritten, that is taken care of before we get to this method. filename is obviously the path to the file to be written; delimiter is what to put in between cells (defaults to a tab character).

The header_spec_list argument is slightly more complex. It should be a list, and the items in the list should be either strings or dictionaries (a mix of strings and dictionaries is OK). Strings will just be written literally into the file one by one, with delimiter between them. However, dictionary items in the header_spec_list are used to write out multiple columns of headers based on a succinct specification. Those dictionary items should have two keys, base and length. The value in base should be a string that is the base name of the headers in question; the value in length determines how many numbered columns get written out. For example, if we are doing headers of class labels for 5 trials, the dictionary {'base':'Label','length':5} would produce header columns Label0000000, Label0000001, Label0000002, Label0000003, Label0000004.

This method is used by most of the various write_X() methods for various output types to write their headers; users would not normally have to interact with it directly, but it is available for use if, for example, you are implementing your own custom output function and want to base it on the built-in ones.


write_one_row_values( cls, filename, format_specs, values, delimiter="\t" )

(no return value) Writes out one row of data values to a text file. This is always done via appending; if a file is supposed to be overwritten, that is taken care of before we get to this method. filename is obviously the path to the file to be written; delimiter is what to put in between cells (defaults to a tab character).

The other arguments are slightly more complex. Both format_specs and values should be lists, and they should be the same length. Each item in format_specs should be a Python format specification string, like ["{0:05d}", "{0:.9f}", "{0}"] or whatever.

Each item in values can either be a single value (string, number, whatever, as long as it matches the corresponding format spec in the format_spec list), or a list of such values; in the latter case, each item in the list will be printed as a separate cell using the same format spec.

For a silly example, if format_specs is

["{0}", "{0:.4f}", "{0:05d}"]

and values is

['row_name', math.pi, [0, 1, 2] ]

then the row of values written into the file would be:

row_name 3.1416 00000 00001 00002

This method is used by most of the various write_X() methods for various output types to write their data rows; users would not normally have to interact with it directly, but it is available for use if, for example, you are implementing your own custom output function and want to base it on the built-in ones.


validate_arguments( cls, args )

(returns True or False) Validates the various input arguments used to initialize a DTOutput object; returns True if they are all OK and False if something is wrong with them (e.g. missing required attributes, wrong values or data types). Typically used to check the format of a JSON job file, and as such would be called by DTJob when the job file is read in (rather than a user calling this method directly).

Note that currently this method works in the laziest way possible, namely it just tries to create a temporary DTOutput object with the arguments given. If that object is created successfully, then it returns True; if some kind of error occurs, it returns False. In the future, hopefully we will make this method a bit smarter so it can actually inspect the arguments and give more useful feedback on exactly what is wrong with them.

Note also that if a Python global variable named dt_debug_mode is defined and set to True, a failed validation will cause an error rather than just making this method return False. Right now dt_debug_mode does default to True, but in the future we intend to some day change this behavior to the more graceful validation failure behavior of simply giving an informative warning.



JSON job file options

Generally what you specify in a JSON job file will be some subset of the attributes listed in the Attributes section above; however, not all attributes will need to be included in a typical job file. So here is a quick recap of the attributes that would typically be a good idea to include in a job file, and what data type they should be. For details on how they behave, see the Attributes section. As always, we recommend that you check out the sample_jobfiles directory and/or our video tutorials for some usage hints.

output_location (string): A path for the directory where the output should go. You should pretty much always include this attribute.

output_filename_stem (string): A base filename for any output to be created; the actual output filenames will be this stem + various suffixes. You should pretty much always include this attribute as well -- and remember to update it when you tweak an analysis based on an existing job file!.

output_file_types (string or list of strings): One or more type codes for the types of output files you'd like to be produced in this analysis. Once more, you should pretty much always include this attribute if you want any output to be produced at all. And as an ancient machine learning philosopher once asked, if a decision tree falls in a random forest but no one is around to observe the output, did it even make a prediction?

pre_existing_file_behavior (string): One of several possible string codes for the intended behavior if it happens that an output file already exists when DTOutput goes to create it. The default of silent_append is fairly reasonable and non-destructive, so you can consider this attribute optional, but worth considering for inclusion in your job file if you want one of the other behaviors.

delimiter (string): Just the character or string that you want delimiting the cells in most of the output file types. The default of tab is pretty reasonable, but if you are a filthy comma lover or something, you can include whatever you want in your job file. Heck, delimit your outputs with BABABOOEY for all we care.

file_extension (string): Just the extension you want most output files to end with. The default of .tsv matches the default tab delimiter, but if you change the delimiter to a comma or BABABOOEY or whatever, you probably want to make this .csv or .bbsv or whatnot to match.