PREFACE TO ALL DOCUMENTATION: We have tried to be as comprehensive, helpful, and accurate as we can in all of these documents, but providing good documentation is always an uphill climb. Our code, and the underlying backend code we rely on, is always changing, which means things can easily go out of date; and doing these kinds of analyses is intrinsically a complicated process, which makes it hard to write documentation that works well for people at all different levels of technical proficiency and familiarity with the underlying concepts and technologies. We really don't want the learning curve to be a barrier to people using this toolbox, so we highly recommend -- especially while the number of users is relatively small and manageable -- getting in touch with the developers if you're confused, don't know where to start, etc., etc. And, of course, if you find any errors, omissions, or inconsistencies! Seriously, don't be a stranger... we are happy to add features, flesh out documentation, walk through setup, and so on, to help this project serve as many users as possible, but that requires hearing from you to help let us know what areas need the most attention. Please see our website (
http://delineate.it/) and click on the contact page to get a link to our Bitbucket repo and an email address for the project devs.
The primary class that maintains information about a job (or list of jobs) to run. The most common usage of this class would probably be indirectly, e.g., by passing a JSON-based job configuration file to
delineate.py, which basically just creates
DTJob objects out of each file it is given and runs them in sequence. So if you are using
delineate.py and specifying your jobs in JSON config files, you don't need to know too much about the inner workings of this class. Just see the documentation for the JSON-format job config files and
delineate.py, and you should be all set.
However, if you are importing these modules manually and writing your own Python code, you may find yourself creating
DTJob objects. See
dt_design_looper_example.py for a basic script that does this.
The main thing to know in either case is that
DTJob essentially maintains a structure of parameters (or hyperparameters, in deep learning parlance) that define a given analysis using a given dataset. Eventually, when the
run() method is called, it will use those (hyper-)parameters to create
DTModel objects. So you can think of
DTJob as kind of an umbrella container for those objects, although in the current implementation it doesn't hang on to them for any length of time -- those objects are just created long enough to run the analysis, and then destroyed. Another way to think of
DTJob objects is as the thawed, usable form of the information frozen in JSON-based job config files.
job_file (string): A filename or path to a JSON-format job configuration file that will be used to create the rest of the
DTJob object. This is the only attribute that can be specified when creating a new
DTJob object via the
__init__() method; if that's the way you are going to create your
DTJob object (which is essentially what
delineate.py does), then you won't have to fiddle with any of the other attributes. In this case, all you'd have to do is something like:
my_job = DTJob.DTJob( 'my_job_file.json' )
Although if that's all you want to do, you could probably just run
delineate.py and pass it
my_job_file.json in the first place.
If you are manually creating your
DTJob object, then this attribute is not mandatory to specify; in
__init()__, it defaults to
None. So you can also just do:
my_job = DTJob.DTJob()
... and then fill in the rest of the attributes later, before you run the job.
Another option would be to load a basic job configuration in via a JSON file, but then tweak the job structure manually before you actually run. This is fundamentally not too different from specifying the whole job structure yourself in code, but it might save you a few lines of code if you typically use a very similar configuration (which can be saved in the JSON file and loaded in to create a template
DTJob object) and just want to tweak a few things in your Python code.
job_file is just going to be passed along to Python's regular
open() function eventually, so it can be a full path or a bare filename or whatever, just as long as it is accessible from wherever you're running.
job_structure (list of dictionaries): This is the actual data structure representing what is going to get run (and with what data), either specified in
job_file or in code that you write.
It follows the same basic format as the JSON-format job config files, so for a breakdown of all the stuff that can go into this data structure, see the documentation for those files.
job_structure is a list of Python dictionaries. (However, if you make it a single Python dictionary,
DTJob should generally be nice and turn it into a single-item list as necessary... but you should probably think of it and use it as a list of dictionaries to avoid problems down the line.) Each dictionary contains the information needed to run one job. The JSON job config files can contain either a single job or a list of jobs, so loading the contents of one job file into one
job_structure maintains a 1:1 correspondence between actual job files and
Each dictionary contains three main keys:
model. The values corresponding to those keys will in turn be other dictionaries; each of these dictionaries contains the (hyper-)parameters necessary to instantiate a
DTAnalysis, and a
DTModel object, respectively. Again, see the docs for the config files for more details on what should go into those.
As suggested above, see
dt_design_looper_example.py for some ideas as to how you might go about creating this structure manually in Python code.
Also, as noted above in the description of the
job_file attribute, if you wanted to initially create a
DTJob object using an actual JSON job config file, then tweak the
job_structure attribute manually before running, that would be perfectly allowable. The
job_file parameter is automatically converted into a
job_structure if you pass in a job file when creating a brand-new
DTJob object, so you can assume if you create a
DTJob object from a file that the
job_structure will exist when the object is done being initialized. Alternately, if you add in a
job_file attribute after creating the
DTJob object for some reason, you will need to call the
reload() method (see below) to turn that file into a
job_structure. In any case, whatever you want to do to modify the
job_structure before calling the
run() method is totally up to you.
last_loaded_job_file (string): Normally this attribute would not need to be set or accessed by the user. It is set whenever the
reload() method (see below) is run, so that we have a record of what job file was last loaded in?
"But, all-knowing documentation writers," you say. "Why do you even need this? Isn't it just the same as the
job_file attribute then?" We chuckle knowingly. "Yes, my child," we condescendingly reply. "But suppose some user, less forward thinking than you or we, changes the
job_file attribute on the fly and then calls the
run() method. Presumably, they would be expecting the new
job_file to get run, but instead the
job_structure would be reflecting an earlier job file. So we save the name of the last job file that we know we loaded, so that
run() can make sure that no one has changed out the
job_file on us, and warn the user if they have."
"Yeah, yeah, that makes sense... but just one more thing," you say, suddenly sounding a lot more like Columbo. "Couldn't you do the same thing without needing another attribute? Either by using a setter function that updates
job_structure every time
job_file is changed, or by checking the actual contents of the
job_structure attribute against the current
run() is called?" We turn around, pause, and nod wearily. "Yes, we could have," we say. "But those sounded like more work than just doing this, and also we just thought of those other options now, while writing this documentation, so it's going to stay that way for a while. And, at least this way, the user gets a warning that changing
job_file on the fly without explicitly then reloading the
job_structure using the
reload() method is kind of a weird thing to do."
job_file_hash (string): Normally, the user will not have to set or access this attribute either. It is currently used in two places: By
DTModel, to determine the name of a tempfile that will be created during Keras analyses, and by
DTOutput, to provide a default output filename if the user has not specified their own. If you really care, you can see the documentation for those classes for a bit more detail. But essentially, if the user loads a
job_file (see above), this gets a hashed version of the contents of that file; if no job file is provided, it gets a generic default value.
The only occasion we can envision where you might want to think about this attribute are when two analyses are running concurrently in the same place in the filesystem. If they are using two different job files, then the
job_file_hash will be different for them, and there is no danger of collision in either the Keras tempfiles or the output (assuming the user was foolish enough not to provide an output filename). But if they are both using the same job file (which would be a weird thing to do), or if the user is writing their own code rather than using JSON job files and has left the
job_file attribute blank, there is the danger of files having the same name and thus getting overwritten. In the first case, the solution should probably be just to not be weird and use two different job files with different output filenames explicitly specified, which will ensure that the tempfiles get unique names also. In the second case, you may want to specify
job_file_hash manually and give it either a random value or some meaningful value that is guaranteed to be unique to the instance of the analysis that is currently running.
Note that most users won't need to invoke these directly if they are creating their analyses via JSON job files, but some brief descriptions are provided for the brave few considering writing their own scripts. As always, if you are considering writing your own scripts, you might want to contact the devs for inside details.
(no return value) Initializer function for creating a new
DTJob object. Pretty much just assigns all the object's attributes, and at this time the only one you can explicitly specify during init time is
When the initializer runs, it mainly just calls the
reload() method (see below); if a
job_file was passed in,
reload() should then populate the other attributes listed in the Attributes section. If no job file was passed in, then most of the attributes just stay
job_file_hash gets a generic default string as detailed in its entry above.
(no return value) Function to load the contents of a JSON job file (namely, the one specified in the
job_file attribute) into the
job_structure attribute. Automatically called during initialization. If users are writing their own code using this module, they should also call
reload() if they manually change the
job_file attribute and want the file to actually get loaded in.
Along the way it does some (fairly basic) validation of the specified file... first in just making sure it exists and has valid JSON syntax. If that's all good, it reads in the data structure from the JSON file and then calls
validate_job_structure() (see below) for additional checking.
After the job file is read in, this function also updates the
job_file_hash attribute with an MD5 hash of the file's contents; see the corresponding entry in the Attributes section above for more info on what that hash is used for.
(no return value) This is a pretty big function at the heart of the entire toolbox; it runs the analysis. If you have used the
delineate.py script, you may have noticed that it is basically just a loop of creating
DTJob objects and then calling
run() on them. If you are writing your own code using this module, you probably will call
run() at some point.
Despite its importance, it's a pretty simple function. It checks to make sure there is actually a job to run, calls
generate_analysis() (see below) to use the info in the
job_structure attribute to actually create a
DTAnalysis object and its
DTOutput sub-objects, and then passes control off to the
run() method of the
DTAnalysis object to do the heavy lifting.
Oh, and if there are multiple jobs in the
job_structure attribute (which should be a list in general, but it could be a list of one item),
run() loops through those in order.
This function also handles the
KeyboardInterrupt that is generated when the user presses Ctrl-C by printing a message onscreen and proceeding to the next job in the list (if any). So if your analysis isn't going well and you want to terminate it early, feel free to do so without fear of also killing any additional jobs that might be after it in the queue. (Just don't hold down Ctrl-C too long... we're not sure if that generates multiple
KeyboardInterrupts because we're too scared to try, and it might be system-specific anyway? But if you do hold it down and it kills multiple jobs as a result, don't say we didn't warn you.)
(returns a fully-populated DTAnalysis object, ready to be run) Takes in a single dictionary (e.g., one element of the list in the
job_structure attribute) describing the (hyper)parameters of an analysis and uses it to create a living, breathing
DTAnalysis object, along with the
DTOutput objects that live inside its little kangaroo pouch.
Mostly what it does is pretty simple; it breaks out the values for the
output keys in the job structure (for more details, see the documentation on the format of the JSON job files) and uses those to instantiate the
In the process, it also fills in a few attributes of those objects that the
DTJob object knows about (e.g. it passes along the
job_file_hash attribute to both the
DTOutput objects, for purposes of naming temp files and output files). And, it sets up associations among the
DTOutput objects (i.e., it assigns the
DTOutput objects to be attributes of the
DTAnalysis object, but also sets up the reference that the
DTOutput object retains back to its parent
If you are writing your own code using this module, it is conceivable that you might use this function (e.g., if you want to generate a
DTAnalysis from a job structure and then go off and do your own thing with it), but in typical usage, you'd probably just call
run() (see above) and let that take care of everything.
(returns True or False) Checks the format of a job structure (or list of job structures); returns
True if all is well and
False if not. Mainly does so by looping through
validate_job_structure_onedict() (see below). It's worth noting that this is not an incredibly comprehensive check; it won't necessarily catch all errors that could crop up when you actually run an analysis, it just confirms that the job structure is not SO broken that it can't even create the basic
DTOutput objects that comprise a job to run.
Note that if nothing is passed in for the
struct_to_validate argument, the default is to use the
job_structure attribute (see Attributes section for details). In practice this flexibility is a little pointless, since the one place where
validate_job_structure() is used in the toolbox code is to check potential candidates for the
job_structure attribute. However, it does enable to the user to use this method to validate arbitrary job structures if they really desire; why you might want to do this outside of the normal use cases of the
DTJob functionality is unclear to us, but we aren't here to judge your bizarre life choices. The only catch is that because this is not a class method, you would have to create a blank
DTJob object in order to be able to use the method.
(returns True or False) Checks the format of a single dictionary specifying a job structure, returning
True if all is well and
False if not. A warning message is also printed giving a little bit of detail if the check fails. Users would typically not need to call this sub-function directly; they could just call
validate_job_structure() and let it take care of the details.
This check is pretty basic and essentially proceeds in two steps. First, it extracts the values for the
output keys in the job structure. A valid job structure has to have SOMETHING in place for all four of those elements, so if any of them are missing, the validation check fails.
If all four of those elements exist, we go on to call the
validate_arguments() method within each of the
DTOutput classes, passing them the relevant arguments from the job structure. (See those modules for details, but in short: Right now validation is very basic for all of them and essentially consists of trying to instantiate an object for each of them with the arguments provided. If an error is encountered, it is presumed to be due to bad arguments and the check fails. More sophisticated/smart argument checking may come in the future, but this basic version works OK for now.) If any one of those
validate_arguments() checks fails, the overall validation check fails. If all four of them pass, then we're all good and return
(returns a converted job structure) There is almost no chance anyone will ever want/need to run this method, since it is almost entirely vestigial at this point; however, we will document it for posterity and for the sake of convincing anyone who comes across it that they can safely ignore it.
Basically, in very early versions of this toolbox,
DTOutput did not exist yet and output options were all under the umbrella of the
DTAnalysis object. Before long we realized we had enough output options and functionality to warrant a dedicated output class, and
DTOutput was born.
This method just takes job structures from job files written in the old style and migrates the output options to the new style. All of our examples should be in the new style, so no one except the devs should ever encounter the old style, and even we have mostly updated all of our ancient-est job files to be in the newer format.
So... yeah. TL;DR, don't worry about this method. [Jedi mind trick gesture] You can go about your business, move along.