PREFACE TO ALL DOCUMENTATION: We have tried to be as comprehensive, helpful, and accurate as we can in all of these documents, but providing good documentation is always an uphill climb. Our code, and the underlying backend code we rely on, is always changing, which means things can easily go out of date; and doing these kinds of analyses is intrinsically a complicated process, which makes it hard to write documentation that works well for people at all different levels of technical proficiency and familiarity with the underlying concepts and technologies. We really don't want the learning curve to be a barrier to people using this toolbox, so we highly recommend -- especially while the number of users is relatively small and manageable -- getting in touch with the developers if you're confused, don't know where to start, etc., etc. And, of course, if you find any errors, omissions, or inconsistencies! Seriously, don't be a stranger... we are happy to add features, flesh out documentation, walk through setup, and so on, to help this project serve as many users as possible, but that requires hearing from you to help let us know what areas need the most attention. Please see our website (http://delineate.it/
) and click on the contact page to get a link to our Bitbucket repo and an email address for the project devs.
The primary class that maintains information about a job (or list of jobs) to run. The most common usage of this class would probably be indirectly, e.g., by passing a JSON-based job configuration file to delineate.py
, which basically just creates DTJob
objects out of each file it is given and runs them in sequence. So if you are using delineate.py
and specifying your jobs in JSON config files, you don't need to know too much about the inner workings of this class. Just see the documentation for the JSON-format job config files and delineate.py
, and you should be all set.
However, if you are importing these modules manually and writing your own Python code, you may find yourself creating DTJob
objects. See dt_design_looper_example.py
for a basic script that does this.
The main thing to know in either case is that DTJob
essentially maintains a structure of parameters (or hyperparameters, in deep learning parlance) that define a given analysis using a given dataset. Eventually, when the run()
method is called, it will use those (hyper-)parameters to create DTAnalysis
, DTData
, and DTModel
objects. So you can think of DTJob
as kind of an umbrella container for those objects, although in the current implementation it doesn't hang on to them for any length of time -- those objects are just created long enough to run the analysis, and then destroyed. Another way to think of DTJob
objects is as the thawed, usable form of the information frozen in JSON-based job config files.
job_file (string): A filename or path to a JSON-format job configuration file that will be used to create the rest of the DTJob
object. This is the only attribute that can be specified when creating a new DTJob
object via the __init__()
method; if that's the way you are going to create your DTJob
object (which is essentially what delineate.py
does), then you won't have to fiddle with any of the other attributes. In this case, all you'd have to do is something like:
import DTJob
my_job = DTJob.DTJob( 'my_job_file.json' )
my_job.run()
Although if that's all you want to do, you could probably just run delineate.py
and pass it my_job_file.json
in the first place.
If you are manually creating your DTJob
object, then this attribute is not mandatory to specify; in __init()__
, it defaults to None
. So you can also just do:
import DTJob
my_job = DTJob.DTJob()
... and then fill in the rest of the attributes later, before you run the job.
Another option would be to load a basic job configuration in via a JSON file, but then tweak the job structure manually before you actually run. This is fundamentally not too different from specifying the whole job structure yourself in code, but it might save you a few lines of code if you typically use a very similar configuration (which can be saved in the JSON file and loaded in to create a template DTJob
object) and just want to tweak a few things in your Python code.
Note that job_file
is just going to be passed along to Python's regular open()
function eventually, so it can be a full path or a bare filename or whatever, just as long as it is accessible from wherever you're running.
job_structure (list of dictionaries): This is the actual data structure representing what is going to get run (and with what data), either specified in job_file
or in code that you write.
It follows the same basic format as the JSON-format job config files, so for a breakdown of all the stuff that can go into this data structure, see the documentation for those files.
Essentially, though, job_structure
is a list of Python dictionaries. (However, if you make it a single Python dictionary, DTJob
should generally be nice and turn it into a single-item list as necessary... but you should probably think of it and use it as a list of dictionaries to avoid problems down the line.) Each dictionary contains the information needed to run one job. The JSON job config files can contain either a single job or a list of jobs, so loading the contents of one job file into one job_structure
maintains a 1:1 correspondence between actual job files and DTJob
objects.
Each dictionary contains three main keys: data
, analysis
, and model
. The values corresponding to those keys will in turn be other dictionaries; each of these dictionaries contains the (hyper-)parameters necessary to instantiate a DTData
, a DTAnalysis
, and a DTModel
object, respectively. Again, see the docs for the config files for more details on what should go into those.
As suggested above, see dt_design_looper_example.py
for some ideas as to how you might go about creating this structure manually in Python code.
Also, as noted above in the description of the job_file
attribute, if you wanted to initially create a DTJob
object using an actual JSON job config file, then tweak the job_structure
attribute manually before running, that would be perfectly allowable. The job_file
parameter is automatically converted into a job_structure
if you pass in a job file when creating a brand-new DTJob
object, so you can assume if you create a DTJob
object from a file that the job_structure
will exist when the object is done being initialized. Alternately, if you add in a job_file
attribute after creating the DTJob
object for some reason, you will need to call the reload()
method (see below) to turn that file into a job_structure
. In any case, whatever you want to do to modify the job_structure
before calling the run()
method is totally up to you.
last_loaded_job_file (string): Normally this attribute would not need to be set or accessed by the user. It is set whenever the reload()
method (see below) is run, so that we have a record of what job file was last loaded in?
"But, all-knowing documentation writers," you say. "Why do you even need this? Isn't it just the same as the job_file
attribute then?" We chuckle knowingly. "Yes, my child," we condescendingly reply. "But suppose some user, less forward thinking than you or we, changes the job_file
attribute on the fly and then calls the run()
method. Presumably, they would be expecting the new job_file
to get run, but instead the job_structure
would be reflecting an earlier job file. So we save the name of the last job file that we know we loaded, so that run()
can make sure that no one has changed out the job_file
on us, and warn the user if they have."
"Yeah, yeah, that makes sense... but just one more thing," you say, suddenly sounding a lot more like Columbo. "Couldn't you do the same thing without needing another attribute? Either by using a setter function that updates job_structure
every time job_file
is changed, or by checking the actual contents of the job_structure
attribute against the current job_file
when run()
is called?" We turn around, pause, and nod wearily. "Yes, we could have," we say. "But those sounded like more work than just doing this, and also we just thought of those other options now, while writing this documentation, so it's going to stay that way for a while. And, at least this way, the user gets a warning that changing job_file
on the fly without explicitly then reloading the job_structure
using the reload()
method is kind of a weird thing to do."
job_file_hash (string): Normally, the user will not have to set or access this attribute either. It is currently used in two places: By DTModel
, to determine the name of a tempfile that will be created during Keras analyses, and by DTOutput
, to provide a default output filename if the user has not specified their own. If you really care, you can see the documentation for those classes for a bit more detail. But essentially, if the user loads a job_file
(see above), this gets a hashed version of the contents of that file; if no job file is provided, it gets a generic default value.
The only occasion we can envision where you might want to think about this attribute are when two analyses are running concurrently in the same place in the filesystem. If they are using two different job files, then the job_file_hash
will be different for them, and there is no danger of collision in either the Keras tempfiles or the output (assuming the user was foolish enough not to provide an output filename). But if they are both using the same job file (which would be a weird thing to do), or if the user is writing their own code rather than using JSON job files and has left the job_file
attribute blank, there is the danger of files having the same name and thus getting overwritten. In the first case, the solution should probably be just to not be weird and use two different job files with different output filenames explicitly specified, which will ensure that the tempfiles get unique names also. In the second case, you may want to specify job_file_hash
manually and give it either a random value or some meaningful value that is guaranteed to be unique to the instance of the analysis that is currently running.
Note that most users won't need to invoke these directly if they are creating their analyses via JSON job files, but some brief descriptions are provided for the brave few considering writing their own scripts. As always, if you are considering writing your own scripts, you might want to contact the devs for inside details.
__init__( self
, job_file
=None )
(no return value) Initializer function for creating a new DTJob
object. Pretty much just assigns all the object's attributes, and at this time the only one you can explicitly specify during init time is job_file
.
When the initializer runs, it mainly just calls the reload()
method (see below); if a job_file
was passed in, reload()
should then populate the other attributes listed in the Attributes section. If no job file was passed in, then most of the attributes just stay None
and job_file_hash
gets a generic default string as detailed in its entry above.
reload( self
)
(no return value) Function to load the contents of a JSON job file (namely, the one specified in the job_file
attribute) into the job_structure
attribute. Automatically called during initialization. If users are writing their own code using this module, they should also call reload()
if they manually change the job_file
attribute and want the file to actually get loaded in.
Along the way it does some (fairly basic) validation of the specified file... first in just making sure it exists and has valid JSON syntax. If that's all good, it reads in the data structure from the JSON file and then calls validate_job_structure()
(see below) for additional checking.
After the job file is read in, this function also updates the job_file_hash
attribute with an MD5 hash of the file's contents; see the corresponding entry in the Attributes section above for more info on what that hash is used for.
run( self
)
(no return value) This is a pretty big function at the heart of the entire toolbox; it runs the analysis. If you have used the delineate.py
script, you may have noticed that it is basically just a loop of creating DTJob
objects and then calling run()
on them. If you are writing your own code using this module, you probably will call run()
at some point.
Despite its importance, it's a pretty simple function. It checks to make sure there is actually a job to run, calls generate_analysis()
(see below) to use the info in the job_structure
attribute to actually create a DTAnalysis
object and its DTData
, DTModel
, and DTOutput
sub-objects, and then passes control off to the run()
method of the DTAnalysis
object to do the heavy lifting.
Oh, and if there are multiple jobs in the job_structure
attribute (which should be a list in general, but it could be a list of one item), run()
loops through those in order.
This function also handles the KeyboardInterrupt
that is generated when the user presses Ctrl-C by printing a message onscreen and proceeding to the next job in the list (if any). So if your analysis isn't going well and you want to terminate it early, feel free to do so without fear of also killing any additional jobs that might be after it in the queue. (Just don't hold down Ctrl-C too long... we're not sure if that generates multiple KeyboardInterrupt
s because we're too scared to try, and it might be system-specific anyway? But if you do hold it down and it kills multiple jobs as a result, don't say we didn't warn you.)
generate_analysis( self
, job_item
=None )
(returns a fully-populated DTAnalysis object, ready to be run) Takes in a single dictionary (e.g., one element of the list in the job_structure
attribute) describing the (hyper)parameters of an analysis and uses it to create a living, breathing DTAnalysis
object, along with the DTData
, DTModel
, and DTOutput
objects that live inside its little kangaroo pouch.
Mostly what it does is pretty simple; it breaks out the values for the analysis
, data
, model
, and output
keys in the job structure (for more details, see the documentation on the format of the JSON job files) and uses those to instantiate the DTAnalysis
, DTData
, DTModel
, and DTOutput
objects.
In the process, it also fills in a few attributes of those objects that the DTJob
object knows about (e.g. it passes along the job_file_hash
attribute to both the DTModel
and DTOutput
objects, for purposes of naming temp files and output files). And, it sets up associations among the DTAnalysis
, DTData
, DTModel
, and DTOutput
objects (i.e., it assigns the DTData
, DTModel
, and DTOutput
objects to be attributes of the DTAnalysis
object, but also sets up the reference that the DTOutput
object retains back to its parent DTAnalysis
object).
If you are writing your own code using this module, it is conceivable that you might use this function (e.g., if you want to generate a DTAnalysis
from a job structure and then go off and do your own thing with it), but in typical usage, you'd probably just call run()
(see above) and let that take care of everything.
validate_job_structure( self
, struct_to_validate
=None )
(returns True or False) Checks the format of a job structure (or list of job structures); returns True
if all is well and False
if not. Mainly does so by looping through validate_job_structure_onedict()
(see below). It's worth noting that this is not an incredibly comprehensive check; it won't necessarily catch all errors that could crop up when you actually run an analysis, it just confirms that the job structure is not SO broken that it can't even create the basic DTAnalysis
, DTData
, DTModel
, and DTOutput
objects that comprise a job to run.
Note that if nothing is passed in for the struct_to_validate
argument, the default is to use the job_structure
attribute (see Attributes section for details). In practice this flexibility is a little pointless, since the one place where validate_job_structure()
is used in the toolbox code is to check potential candidates for the job_structure
attribute. However, it does enable to the user to use this method to validate arbitrary job structures if they really desire; why you might want to do this outside of the normal use cases of the DTJob
functionality is unclear to us, but we aren't here to judge your bizarre life choices. The only catch is that because this is not a class method, you would have to create a blank DTJob
object in order to be able to use the method.
validate_job_structure_onedict( self
, dict_to_validate
)
(returns True or False) Checks the format of a single dictionary specifying a job structure, returning True
if all is well and False
if not. A warning message is also printed giving a little bit of detail if the check fails. Users would typically not need to call this sub-function directly; they could just call validate_job_structure()
and let it take care of the details.
This check is pretty basic and essentially proceeds in two steps. First, it extracts the values for the analysis
, data
, model
, and output
keys in the job structure. A valid job structure has to have SOMETHING in place for all four of those elements, so if any of them are missing, the validation check fails.
If all four of those elements exist, we go on to call the validate_arguments()
method within each of the DTAnalysis
, DTData
, DTModel
, and DTOutput
classes, passing them the relevant arguments from the job structure. (See those modules for details, but in short: Right now validation is very basic for all of them and essentially consists of trying to instantiate an object for each of them with the arguments provided. If an error is encountered, it is presumed to be due to bad arguments and the check fails. More sophisticated/smart argument checking may come in the future, but this basic version works OK for now.) If any one of those validate_arguments()
checks fails, the overall validation check fails. If all four of them pass, then we're all good and return True
.
convert_old_json_output_options_to_new( self
, job_struct
)
(returns a converted job structure) There is almost no chance anyone will ever want/need to run this method, since it is almost entirely vestigial at this point; however, we will document it for posterity and for the sake of convincing anyone who comes across it that they can safely ignore it.
Basically, in very early versions of this toolbox, DTOutput
did not exist yet and output options were all under the umbrella of the DTAnalysis
object. Before long we realized we had enough output options and functionality to warrant a dedicated output class, and DTOutput
was born.
This method just takes job structures from job files written in the old style and migrates the output options to the new style. All of our examples should be in the new style, so no one except the devs should ever encounter the old style, and even we have mostly updated all of our ancient-est job files to be in the newer format.
So... yeah. TL;DR, don't worry about this method. [Jedi mind trick gesture] You can go about your business, move along.