About the DeLINEATE Toolbox   Download   Documentation   Contact/Contribute 

 

 

PREFACE TO ALL DOCUMENTATION: We have tried to be as comprehensive, helpful, and accurate as we can in all of these documents, but providing good documentation is always an uphill climb. Our code, and the underlying backend code we rely on, is always changing, which means things can easily go out of date; and doing these kinds of analyses is intrinsically a complicated process, which makes it hard to write documentation that works well for people at all different levels of technical proficiency and familiarity with the underlying concepts and technologies. We really don't want the learning curve to be a barrier to people using this toolbox, so we highly recommend -- especially while the number of users is relatively small and manageable -- getting in touch with the developers if you're confused, don't know where to start, etc., etc. And, of course, if you find any errors, omissions, or inconsistencies! Seriously, don't be a stranger... we are happy to add features, flesh out documentation, walk through setup, and so on, to help this project serve as many users as possible, but that requires hearing from you to help let us know what areas need the most attention. Please see our website (http://delineate.it/) and click on the contact page to get a link to our Bitbucket repo and an email address for the project devs.


DTJob (class)


The primary class that maintains information about a job (or list of jobs) to run. The most common usage of this class would probably be indirectly, e.g., by passing a JSON-based job configuration file to delineate.py, which basically just creates DTJob objects out of each file it is given and runs them in sequence. So if you are using delineate.py and specifying your jobs in JSON config files, you don't need to know too much about the inner workings of this class. Just see the documentation for the JSON-format job config files and delineate.py, and you should be all set.

However, if you are importing these modules manually and writing your own Python code, you may find yourself creating DTJob objects. See dt_design_looper_example.py for a basic script that does this.

The main thing to know in either case is that DTJob essentially maintains a structure of parameters (or hyperparameters, in deep learning parlance) that define a given analysis using a given dataset. Eventually, when the run() method is called, it will use those (hyper-)parameters to create DTAnalysis, DTData, and DTModel objects. So you can think of DTJob as kind of an umbrella container for those objects, although in the current implementation it doesn't hang on to them for any length of time -- those objects are just created long enough to run the analysis, and then destroyed. Another way to think of DTJob objects is as the thawed, usable form of the information frozen in JSON-based job config files.



Attributes

job_file (string): A filename or path to a JSON-format job configuration file that will be used to create the rest of the DTJob object. This is the only attribute that can be specified when creating a new DTJob object via the __init__() method; if that's the way you are going to create your DTJob object (which is essentially what delineate.py does), then you won't have to fiddle with any of the other attributes. In this case, all you'd have to do is something like:

import DTJob

my_job = DTJob.DTJob( 'my_job_file.json' )

my_job.run()

Although if that's all you want to do, you could probably just run delineate.py and pass it my_job_file.json in the first place.

If you are manually creating your DTJob object, then this attribute is not mandatory to specify; in __init()__, it defaults to None. So you can also just do:

import DTJob

my_job = DTJob.DTJob()

... and then fill in the rest of the attributes later, before you run the job.

Another option would be to load a basic job configuration in via a JSON file, but then tweak the job structure manually before you actually run. This is fundamentally not too different from specifying the whole job structure yourself in code, but it might save you a few lines of code if you typically use a very similar configuration (which can be saved in the JSON file and loaded in to create a template DTJob object) and just want to tweak a few things in your Python code.

Note that job_file is just going to be passed along to Python's regular open() function eventually, so it can be a full path or a bare filename or whatever, just as long as it is accessible from wherever you're running.

job_structure (list of dictionaries): This is the actual data structure representing what is going to get run (and with what data), either specified in job_file or in code that you write.

It follows the same basic format as the JSON-format job config files, so for a breakdown of all the stuff that can go into this data structure, see the documentation for those files.

Essentially, though, job_structure is a list of Python dictionaries. (However, if you make it a single Python dictionary, DTJob should generally be nice and turn it into a single-item list as necessary... but you should probably think of it and use it as a list of dictionaries to avoid problems down the line.) Each dictionary contains the information needed to run one job. The JSON job config files can contain either a single job or a list of jobs, so loading the contents of one job file into one job_structure maintains a 1:1 correspondence between actual job files and DTJob objects.

Each dictionary contains three main keys: data, analysis, and model. The values corresponding to those keys will in turn be other dictionaries; each of these dictionaries contains the (hyper-)parameters necessary to instantiate a DTData, a DTAnalysis, and a DTModel object, respectively. Again, see the docs for the config files for more details on what should go into those.

As suggested above, see dt_design_looper_example.py for some ideas as to how you might go about creating this structure manually in Python code.

Also, as noted above in the description of the job_file attribute, if you wanted to initially create a DTJob object using an actual JSON job config file, then tweak the job_structure attribute manually before running, that would be perfectly allowable. The job_file parameter is automatically converted into a job_structure if you pass in a job file when creating a brand-new DTJob object, so you can assume if you create a DTJob object from a file that the job_structure will exist when the object is done being initialized. Alternately, if you add in a job_file attribute after creating the DTJob object for some reason, you will need to call the reload() method (see below) to turn that file into a job_structure. In any case, whatever you want to do to modify the job_structure before calling the run() method is totally up to you.

last_loaded_job_file (string): Normally this attribute would not need to be set or accessed by the user. It is set whenever the reload() method (see below) is run, so that we have a record of what job file was last loaded in?

"But, all-knowing documentation writers," you say. "Why do you even need this? Isn't it just the same as the job_file attribute then?" We chuckle knowingly. "Yes, my child," we condescendingly reply. "But suppose some user, less forward thinking than you or we, changes the job_file attribute on the fly and then calls the run() method. Presumably, they would be expecting the new job_file to get run, but instead the job_structure would be reflecting an earlier job file. So we save the name of the last job file that we know we loaded, so that run() can make sure that no one has changed out the job_file on us, and warn the user if they have."

"Yeah, yeah, that makes sense... but just one more thing," you say, suddenly sounding a lot more like Columbo. "Couldn't you do the same thing without needing another attribute? Either by using a setter function that updates job_structure every time job_file is changed, or by checking the actual contents of the job_structure attribute against the current job_file when run() is called?" We turn around, pause, and nod wearily. "Yes, we could have," we say. "But those sounded like more work than just doing this, and also we just thought of those other options now, while writing this documentation, so it's going to stay that way for a while. And, at least this way, the user gets a warning that changing job_file on the fly without explicitly then reloading the job_structure using the reload() method is kind of a weird thing to do."

job_file_hash (string): Normally, the user will not have to set or access this attribute either. It is currently used in two places: By DTModel, to determine the name of a tempfile that will be created during Keras analyses, and by DTOutput, to provide a default output filename if the user has not specified their own. If you really care, you can see the documentation for those classes for a bit more detail. But essentially, if the user loads a job_file (see above), this gets a hashed version of the contents of that file; if no job file is provided, it gets a generic default value.

The only occasion we can envision where you might want to think about this attribute are when two analyses are running concurrently in the same place in the filesystem. If they are using two different job files, then the job_file_hash will be different for them, and there is no danger of collision in either the Keras tempfiles or the output (assuming the user was foolish enough not to provide an output filename). But if they are both using the same job file (which would be a weird thing to do), or if the user is writing their own code rather than using JSON job files and has left the job_file attribute blank, there is the danger of files having the same name and thus getting overwritten. In the first case, the solution should probably be just to not be weird and use two different job files with different output filenames explicitly specified, which will ensure that the tempfiles get unique names also. In the second case, you may want to specify job_file_hash manually and give it either a random value or some meaningful value that is guaranteed to be unique to the instance of the analysis that is currently running.



Methods

Note that most users won't need to invoke these directly if they are creating their analyses via JSON job files, but some brief descriptions are provided for the brave few considering writing their own scripts. As always, if you are considering writing your own scripts, you might want to contact the devs for inside details.


__init__( self, job_file=None )

(no return value) Initializer function for creating a new DTJob object. Pretty much just assigns all the object's attributes, and at this time the only one you can explicitly specify during init time is job_file.

When the initializer runs, it mainly just calls the reload() method (see below); if a job_file was passed in, reload() should then populate the other attributes listed in the Attributes section. If no job file was passed in, then most of the attributes just stay None and job_file_hash gets a generic default string as detailed in its entry above.


reload( self )

(no return value) Function to load the contents of a JSON job file (namely, the one specified in the job_file attribute) into the job_structure attribute. Automatically called during initialization. If users are writing their own code using this module, they should also call reload() if they manually change the job_file attribute and want the file to actually get loaded in.

Along the way it does some (fairly basic) validation of the specified file... first in just making sure it exists and has valid JSON syntax. If that's all good, it reads in the data structure from the JSON file and then calls validate_job_structure() (see below) for additional checking.

After the job file is read in, this function also updates the job_file_hash attribute with an MD5 hash of the file's contents; see the corresponding entry in the Attributes section above for more info on what that hash is used for.


run( self )

(no return value) This is a pretty big function at the heart of the entire toolbox; it runs the analysis. If you have used the delineate.py script, you may have noticed that it is basically just a loop of creating DTJob objects and then calling run() on them. If you are writing your own code using this module, you probably will call run() at some point.

Despite its importance, it's a pretty simple function. It checks to make sure there is actually a job to run, calls generate_analysis() (see below) to use the info in the job_structure attribute to actually create a DTAnalysis object and its DTData, DTModel, and DTOutput sub-objects, and then passes control off to the run() method of the DTAnalysis object to do the heavy lifting.

Oh, and if there are multiple jobs in the job_structure attribute (which should be a list in general, but it could be a list of one item), run() loops through those in order.

This function also handles the KeyboardInterrupt that is generated when the user presses Ctrl-C by printing a message onscreen and proceeding to the next job in the list (if any). So if your analysis isn't going well and you want to terminate it early, feel free to do so without fear of also killing any additional jobs that might be after it in the queue. (Just don't hold down Ctrl-C too long... we're not sure if that generates multiple KeyboardInterrupts because we're too scared to try, and it might be system-specific anyway? But if you do hold it down and it kills multiple jobs as a result, don't say we didn't warn you.)


generate_analysis( self, job_item=None )

(returns a fully-populated DTAnalysis object, ready to be run) Takes in a single dictionary (e.g., one element of the list in the job_structure attribute) describing the (hyper)parameters of an analysis and uses it to create a living, breathing DTAnalysis object, along with the DTData, DTModel, and DTOutput objects that live inside its little kangaroo pouch.

Mostly what it does is pretty simple; it breaks out the values for the analysis, data, model, and output keys in the job structure (for more details, see the documentation on the format of the JSON job files) and uses those to instantiate the DTAnalysis, DTData, DTModel, and DTOutput objects.

In the process, it also fills in a few attributes of those objects that the DTJob object knows about (e.g. it passes along the job_file_hash attribute to both the DTModel and DTOutput objects, for purposes of naming temp files and output files). And, it sets up associations among the DTAnalysis, DTData, DTModel, and DTOutput objects (i.e., it assigns the DTData, DTModel, and DTOutput objects to be attributes of the DTAnalysis object, but also sets up the reference that the DTOutput object retains back to its parent DTAnalysis object).

If you are writing your own code using this module, it is conceivable that you might use this function (e.g., if you want to generate a DTAnalysis from a job structure and then go off and do your own thing with it), but in typical usage, you'd probably just call run() (see above) and let that take care of everything.


validate_job_structure( self, struct_to_validate=None )

(returns True or False) Checks the format of a job structure (or list of job structures); returns True if all is well and False if not. Mainly does so by looping through validate_job_structure_onedict() (see below). It's worth noting that this is not an incredibly comprehensive check; it won't necessarily catch all errors that could crop up when you actually run an analysis, it just confirms that the job structure is not SO broken that it can't even create the basic DTAnalysis, DTData, DTModel, and DTOutput objects that comprise a job to run.

Note that if nothing is passed in for the struct_to_validate argument, the default is to use the job_structure attribute (see Attributes section for details). In practice this flexibility is a little pointless, since the one place where validate_job_structure() is used in the toolbox code is to check potential candidates for the job_structure attribute. However, it does enable to the user to use this method to validate arbitrary job structures if they really desire; why you might want to do this outside of the normal use cases of the DTJob functionality is unclear to us, but we aren't here to judge your bizarre life choices. The only catch is that because this is not a class method, you would have to create a blank DTJob object in order to be able to use the method.


validate_job_structure_onedict( self, dict_to_validate )

(returns True or False) Checks the format of a single dictionary specifying a job structure, returning True if all is well and False if not. A warning message is also printed giving a little bit of detail if the check fails. Users would typically not need to call this sub-function directly; they could just call validate_job_structure() and let it take care of the details.

This check is pretty basic and essentially proceeds in two steps. First, it extracts the values for the analysis, data, model, and output keys in the job structure. A valid job structure has to have SOMETHING in place for all four of those elements, so if any of them are missing, the validation check fails.

If all four of those elements exist, we go on to call the validate_arguments() method within each of the DTAnalysis, DTData, DTModel, and DTOutput classes, passing them the relevant arguments from the job structure. (See those modules for details, but in short: Right now validation is very basic for all of them and essentially consists of trying to instantiate an object for each of them with the arguments provided. If an error is encountered, it is presumed to be due to bad arguments and the check fails. More sophisticated/smart argument checking may come in the future, but this basic version works OK for now.) If any one of those validate_arguments() checks fails, the overall validation check fails. If all four of them pass, then we're all good and return True.


convert_old_json_output_options_to_new( self, job_struct )

(returns a converted job structure) There is almost no chance anyone will ever want/need to run this method, since it is almost entirely vestigial at this point; however, we will document it for posterity and for the sake of convincing anyone who comes across it that they can safely ignore it.

Basically, in very early versions of this toolbox, DTOutput did not exist yet and output options were all under the umbrella of the DTAnalysis object. Before long we realized we had enough output options and functionality to warrant a dedicated output class, and DTOutput was born.

This method just takes job structures from job files written in the old style and migrates the output options to the new style. All of our examples should be in the new style, so no one except the devs should ever encounter the old style, and even we have mostly updated all of our ancient-est job files to be in the newer format.

So... yeah. TL;DR, don't worry about this method. [Jedi mind trick gesture] You can go about your business, move along.