About the DeLINEATE Toolbox   Download   Documentation   Contact/Contribute 

 

 

PREFACE TO ALL DOCUMENTATION: We have tried to be as comprehensive, helpful, and accurate as we can in all of these documents, but providing good documentation is always an uphill climb. Our code, and the underlying backend code we rely on, is always changing, which means things can easily go out of date; and doing these kinds of analyses is intrinsically a complicated process, which makes it hard to write documentation that works well for people at all different levels of technical proficiency and familiarity with the underlying concepts and technologies. We really don't want the learning curve to be a barrier to people using this toolbox, so we highly recommend -- especially while the number of users is relatively small and manageable -- getting in touch with the developers if you're confused, don't know where to start, etc., etc. And, of course, if you find any errors, omissions, or inconsistencies! Seriously, don't be a stranger... we are happy to add features, flesh out documentation, walk through setup, and so on, to help this project serve as many users as possible, but that requires hearing from you to help let us know what areas need the most attention. Please see our website (http://delineate.it/) and click on the contact page to get a link to our Bitbucket repo and an email address for the project devs.


DTAnalysis (class)


This is essentially the "master" object class for any analysis run using this toolbox. A DTAnalysis object contains within it a DTData object, a DTModel object, and a DTOutput object, and it handles a bunch of the basic logic of doing everything within a given analysis.

The DTAnalysis object is typically created by DTJob from a specification contained in a JSON-format job file, although you can also create one manually in code if you know what you're doing.



Attributes

nits (integer, USUALLY): An integer indicating the number of iterations of the analysis to run; iterations typically means one complete run of an analysis with a given set of parameters and a particular randomization of training, validation (if necessary), and test sub-datasets. However, the exact meaning may depend on the cross-validation scheme you're using. (See xval_type, nfolds, classify_over, xval_sa, etc., below for more details.)

Note that we say nits is USUALLY an integer. By this we mean that it should be an integer for all currently fully-implemented cross-validation schemes. For transfer learning, a scheme that is currently implemented experimentally but not recommended for anyone outside the dev team to use, nits would be a two-item list of integers, with the first item indicating the number of iterations in the "outer" loop and the second indicating the number of iterations in the "inner" loop. (More on this when/if the transfer module goes public -- stay tuned!)

xval_type (string): String code describing the type of cross-validation scheme to use. Currently implemented options (be warned, we did not use the most descriptive names ever) include:

  • single: Do a single training, validation (if necessary), and test split, train a model, and test it on the test set. This is a pretty common scheme to use if you have a large number of samples (aka trials) and you just want to divvy them up randomly for classification (as opposed to doing classification within subjects or something like that). Good for things like image processing, or for EEG/MEG type analyses where you have a bunch of features (e.g. channels/timepoints) that are essentially the same across subjects, and you want to train/test a "universal" classification model (i.e., one that is trained across trials pulled from all subjects, not trained/tested within subjects). In the accuracy summary output of this cross-validation scheme, you will essentially get just one accuracy value per iteration.

  • loop_over_sa: Do classification separately for each value of a specified "sample attribute." Data samples (aka trials) can be tagged with an arbitrary number of sample attributes, or SA's for short. This method will loop over each unique value of the SA you specify (in the xval_sa attribute, see below) and run your analysis for each one separately. For neuroimaging data, a common use of this might be to have an SA called "subject" representing the human individual from which that trial was drawn, and then loop_over_sa will let you do your training/testing within subjects rather than as a "universal" model. But of course the SA doesn't have to represent subjects -- it could represent any subsetting scheme of your data over which you want to loop and run separate versions of your analysis, such as experiments or subject populations. In the accuracy summary output, unlike for the single scheme above, you'll get multiple accuracy values for each iteration, one for each unique value of your SA (e.g., one per subject), which is listed under the Fold column of the accuracy summary output file.

  • transfer_over_sa: EXPERIMENTAL, DO NOT USE unless you're on the dev team or unless Danger is your middle name. More on this when/if it gets finished/goes public.

  • Another option for xval_type that is not currently implemented, but will likely be added in the relatively near future, is kfold (or something along those lines) for k-fold cross-validation.

  • Other cross-validation schemes are in the works as well. If you have a particular one that you want implemented sooner rather than later, get in touch!

nfolds (integer): This parameter is actually not currently used but is included in the class for when we get around to implementing k-fold cross-validation. At that time it will indicate how many folds to cross-validate over. Right now it does nothing.

train_val_test (3-item list of numeric values): This is a list containing the proportions of the input data to use as training, validation, and/or test datasets, in that order. Not all theoretically possible cross-validation schemes require this parameter, but all the ones currently implemented do.

The items in this list can be either integers or floating-point values, but they have to be numbers and there have to be three of them, even if one of them is not used. (For example, analyses using the PyMVPA backend would not require any validation data, only training/test, but you would still have to specify all three values with the middle value set to 0.) The values should add up to either exactly 100 (i.e., 100 percent) or exactly 1, depending on whether you like percentages or proportions better. If they add up to anything else (within Python's ability to detect, anyway; 99.9 is not good enough but 99.999999999999993 might work), you'll get an error message.

A full discussion of how cross-validation works is beyond the scope of this documentation, but essentially a typical MVPA (e.g., using SVMs or SMLR or something like that) would require you to first generate your model using a "training" dataset and then test it to see how generalizable it is on a held-out "test" dataset. So for a SVM-type analysis, you might specify something like [80, 0, 20] for the train_val_test parameter, which means 80% of the data will be used for training and then the model will be tested on the held-out 20%. Of course your choices here will affect the quality of your analysis output and depend on lots of factors, but generally speaking something between about a 70/30 split and about a 90/10 split is a safe choice for many datasets.

Note that these proportions may be approximated if your input data do not divide up perfectly according to what numbers you specify, so you don't have to worry about getting things exact. The toolbox will automatically get as close as it can while making sure that (1) you don't end up with a non-integer number of samples (aka trials) in any subset of the data, and (2) all of your classes have an equal number of samples/trials, so as not to bias the classification. This means that a certain number of samples/trials may go unused. (New data subsets will be selected randomly on each iteration of the analysis, so samples/trials that go unused on one iteration will probably get used on a different iteration.)

Keras-based deep learning analyses are similar but they add a "validation" dataset. The validation dataset is like another "test" dataset but used only in training. In other words, in deep learning analyses, the algorithm will typically go through multiple rounds of training the neural network and testing it against the validation dataset before deciding the network is trained up about as well as it's ever going to get. (This is assuming you use the "early stopping" feature in Keras, meaning that instead of training for a fixed number of rounds, the network trains until it hits a stopping criterion determined by performance on the validation set. For purposes of this discussion, we will assume everyone will use early stopping.) Only then will it perform the "true" test of the network's performance against the held-out test dataset. Again, exactly which proportions work best for you will depend on many factors including how much data you have, but something along the lines of 60/20/20 or 70/20/10 or 80/10/10 is probably what a lot of folks will end up using. (In Python list notation, that would be [60, 20, 20] and so on.)

scaling_method (string or None): What type of rescaling operation should be done on your data. Used in conjunction with scaling (see below). If both scaling_method and scaling are None (or if you don't specify anything in your JSON configuration file for either of them, as the default for both parameters is None), then no data rescaling is performed and analyses are performed on your raw, un-transformed data values. If one of the scaling methods below is specified, then data are rescaled accordingly. Note that rescaling is performed in a way that preserves statistical independence between your training, validation, and/or test datasets, although the particular details of how that rescaling happens depend on which rescaling option you choose. Note also that how much rescaling affects your classification performance is pretty dependent on your analysis architecture and all of your other parameters... sometimes rescaling can make a big difference and other times it doesn't. (In the average case, with data values that aren't too weirdly distributed, it probably makes a small-to-medium difference.) Currently implemented options include:

  • percentile: All data are rescaled according to the Nth percentile of the absolute value of all training data points, where N is the value of the scaling parameter below.

    This is a little bit weird at first glance but makes sense in many cases. It works best for data that are either non-negative or more-or-less centered around zero (for example, EEG data where zero would mean zero voltage on the scalp relative to the reference electrode). It may not be an ideal choice for datasets that include negative values but whose values are not equally distributed on the positive and negative sides. The idea is that it lets you do something similar to normalizing your values into a pre-defined range like 0-1, but with the normalization based on a moderately-high value in your dataset rather than the very maximum value, which could be a non-representative outlier. And by "high," we mean in absolute value terms (distance from zero) -- for scaling purposes we don't really care if the most extreme values are on the positive or the negative end. Although again, things could get funky if you have both positive and negative values in your data but they aren't at least approximately equally distributed around zero.

    As an example, if you use this scaling_method value and set scaling (see below) to 80, all the values in the dataset would be divided by whatever value represents the 80th percentile of the magnitudes (absolute values) of the values in your training dataset. Note that we do use the value calculated from the percentiles of the training dataset to scale the validation and test datasets as well, because the training dataset is often the largest and thus most statistically robust. However, as all we're doing is dividing all data by a common scale factor (and the characteristics of the validation/test datasets are not used in the percentile calculations), this should not cause a problem for the statistical independence of the training, validation, and test datasets. The reason we do this is because we don't want statistical fluctuations to cause different scale factors to be applied to the training, validation, and test datasets.

    Note that, for boring historical reasons (basically, so some very old scripts won't break), this is the default rescaling method chosen if you specify None for scaling_method but DO put a value in for scaling. We don't exactly suggest you do that though, because it's probably better to explicitly specify scaling_method for the sake of clarity even if you do intend to use the percentile option. Just giving you the heads-up.

    Also, to reiterate -- the percentile calculation is done over all training data points, not separately for each feature (voxel or electrode/timepoint or whatever). So the assumption is that all of your data points have SOMETHING in common that makes it reasonable to scale the data across all features. If you want to rescale in a more fine-grained manner, you might have to do it yourself before reading the data into this toolbox -- currently we have no implemented way to scale the data WITHIN features, just because there are a million ways you might want to do it, many of which could require knowing something about the structure of the dataset that would break the toolbox's minimal set of assumptions about what you're going to hand it. This is really just a basic rescaling to get values into a semi-standardized range... if you want anything fancier you might have to roll your own (or request it as a feature for a future toolbox release).

  • standardize: Similar to percentile above, but converts the data to Z-scores instead of doing a percentile-based scaling. Most of what we said above for percentile applies to standardize as well -- for example, all scaling is done based on values from the training dataset, and scaling is performed across all data points (not separately for each feature). So, see percentile above for all the quirks and caveats... this one is pretty similar in everything but the actual way we rescale the data.

    Note: the scaling value (see below) is not used for this scaling method; if you specify anything, it is silently ignored.

  • map_range: Scales all training data into a given range, which by default is zero to one. In other words, for the default 0-1 range, this subtracts the minimum value in the training dataset from all training dataset values, then divides all values by the largest value in the post-subtraction dataset, to get everything into the 0-1 range. The idea is similar for other ranges, but you can also scale your data from 0 to 100 or -40 to 137 or whatever you please. To specify a minimum/maximum other than 0 and 1, make the scaling value (see below) a two-item list where both items are numeric values. The first value will be taken as the minimum of the mapped range, and the second value will specify the maximum.

    Note that, as with percentile and standardize, the mapping is done based on the values in the training dataset. So your validation and/or test datasets may end up with values outside the 0-1 range (or whatever non-default range you specify), or they may not contain the ends of the range. But the scaling factors that are applied to validation and test datasets will be the same scaling parameters that are applied to the training dataset.

  • mean_center: Like Z-scoring but no dividing by the standard deviation -- just subtracting out the mean. As with other methods, this is done across ALL data points in the training dataset, and the value calculated from the training dataset will also be applied to the validation and test datasets. (So, you will be guaranteed that the values in the training dataset will end up centered around 0, but the validation and/or test datasets may not be exactly centered around 0 themselves. Although if the distributions of your validation and test datasets are pretty similar to the distribution of your training dataset, things should come pretty close to being centered around 0.) The scaling value (see below) is ignored for this too.

scaling (single numeric value, list of 2 numeric values, or None): Specifies any numeric information needed to implement any of the scaling_method options listed above. For percentile it's a single number indicating what percentile, for standardize this value is ignored, for map_range it's a two-item list specifying the minimum and maximum values in the range, and for mean_center this value is ignored. If the value is unspecified or specified as None, semi-reasonable defaults chosen by your benevolent development overlords will be used.

classify_over (string, probably): The name of whatever "sample attribute," or SA for short (see DTData documentation for more info), you want to do your classification over. In theory this doesn't HAVE to be a string -- we don't do a lot of checking, so in theory it could be anything that Python allows you to use as a key to a dictionary -- but if you are vaguely sane, it will almost always be a string.

To briefly summarize the DTData stuff, each sample (aka trial) in your dataset can be tagged with one or more SA's. These could be things like class, or subject, or run, or really any attribute that a sample/trial could have. Each of these SAs has a name that you give it when you load in your data.

Anyway, classify_over is just the name of whatever SA you want to run your classification on. For example, say I have some data where people are looking at faces or scenes in either the left or right side of their visual field. I might have those attributes tagged with SAs named category and field (you can use whatever names you want). If I want to run a classification of faces vs scenes, I would specify category for classify_over. If I want to classify left vs right instead, I could just change classify_over to field. Easy, right?!

backend_options (dictionary or None): A set of options that are essentially passed straight along to whatever backend you're using, in theory. In practice, at the moment only the Keras backend has any options that use this parameter -- if you are using the PyMVPA backend then backend_options just gets ignored. And we have kind of renamed a bunch of them because they're used in different places within Keras that we're trying to abstract away from you, dear user, by processing them ourselves at appropriate junctures, so don't go looking for things named exactly like these options in the Keras source code. If you do want to look up details in the Keras documentation/code, we have tried to note what the underlying Keras versions are called in the descriptions below.

We think these more-or-less global options make more sense as part of DTAnalysis than any other class, so we typically specify them here -- although our dirty little secret is that they basically just get passed along to DTModel for most of the heavy lifting. So if you're curious about how we parse these options and send them to Keras under the hood, look at DTModel.py instead of DTAnalysis.py for the down-and-dirty.

In brief, currently implemented Keras backend options include the following:

  • monitor (string): The name of the quantity upon which to base Keras's early stopping (keras.callbacks.EarlyStopping). Our default is val_loss, which is also Keras's default. Presumably, most of the time you are going to want to just use that. If you want to use something else, you are probably some kind of deep learning expert who already knows what they're doing; however, if not, and you're curious about what other options might be out there, consult the documentation or actual code for keras.callbacks.EarlyStopping.

  • patience (integer): Patience parameter for Keras's early stopping. Our default is 50. This refers to how many rounds of model training to do after finding the previous best model before early stopping kicks in. You may need to play around with this a bit to get the speed/performance ratio you want. Larger values will give you a better chance of finding the best possible model during training, but training will take longer. If your models tend to converge and get about as good as they're going to get quickly, you can turn it down; if your models aren't converging very well and just seem to be flailing around early in training (and getting chance performance in validation), turning up the patience may help them converge better. As with the above parameter, see the keras.callbacks.EarlyStopping documentation/code if you want more details.

  • checkpoint_filename (string): The name of a temporary file that we have to write out during Keras analyses. Not that critical for you to specify unless you truly feel passionately about your tempfile names. We should generally clean up after ourselves by deleting these at the end of an analysis, but in case one of these does not get cleared away for some reason (such as, the process gets killed unexpectedly), you may want to know that the default is delineate_checkpoint_weights_tempfile_ plus a "hash" string calculated from the job file you're running. So if you see such a file lying around it is safe to delete (assuming you don't have an analysis running currently... if you do have an analysis running, just leave it be, and it will probably get cleaned up on its own later if the analysis finishes normally).

  • batch_size (integer): The size of one "batch" in Keras, namely, the number of samples (trials) per "gradient update," in Keras terms. In lay terms, let's say you have 1000 trials in your training dataset. With a batch size of 100, each epoch (see below) will happen in 10 batches; with a batch size of 200, it will only take 5 batches. Your choice of batch size could affect your model's performance in terms of accuracy, but more often the choice is about how fast training goes. Bigger batches means fewer batches per epoch, which means faster training. However, all the data in a batch has to be able to fit into your GPU's memory at one time, so you can't just crank this value indefinitely high. Our default is 500 but you may want/need to tweak this based on your GPU's available RAM and/or other factors. (As with most of our Keras-specific parameters, there is plenty of information out there on batch size in Keras if you want to know more.)

  • epochs (integer): How many times to train your model. We default to 10000 but in practice it may not matter, because our default is always to have early-stopping on (i.e. stop training the neural network when it seems like learning has plateaued, as discussed above), and we almost never get to this number of epochs before early-stopping kicks in. This parameter may become more relevant to users if we ever implement the option to disable early stopping. For now, you probably don't need to worry about it unless you are doing something very special and weird, and in that case you hopefully already understand whatever that weird thing is that you're doing.

  • verbose (integer): Keras verbosity level, aka how much text to spit out during training. Choices are 0, 1, or 2; default is 1 and that is a reasonable value to choose in most cases, if you want to watch/monitor your analysis in the terminal. You might choose 0 if you don't want any text output (presumably in this case you have already tuned your model architecture and just want to run it through a bunch of data). 2 is probably more output than most people desire. Note that we currently have a Bitbucket issue on the list to replace this with a more toolbox-general verbosity setting, so don't get too attached to it -- but we will try to use similar numbers and semantics to Keras's values for reasonable backwards compatibility when/if we implement our own verbosity setting.

  • reduction_patience (integer): When the model has not improved for a certain number of epochs, the learning rate is reduced to afford more subtle refinements in its parameters. This value (which is simply called patience in keras.callbacks.ReduceLROnPlateau, if you want to look up the details in the Keras docs) determines how many epochs of non-improvement it should take before reducing the learning rate. Our default is the same as Keras's default, which is 10. Generally, this should be a value less than the regular patience option. (For historical reasons, we only support this option in Keras 2.0 and up, but we don't really support or endorse the use of earlier versions anyway. If you really want to use this option with earlier versions of Keras, let us know, but be forewarned that the legacy code for Keras 1.x is marked for deprecation in future releases.)

  • reduction_factor (numeric value): How much to reduce the learning rate by each time learning rate reduction kicks in, as described in reduction_patience above. This is simply called factor in keras.callbacks.ReduceLROnPlateau if you want to look up more details in the Keras docs. Note that the Keras default is 0.1 but our default is a more conservative 0.5. A value somewhere between those two numbers would generally be reasonable.

Again, the above backend options are presently for KERAS ONLY -- they will be silently ignored in PyMVPA. Formatting them in a JSON job file might be a bit confusing at first (dictionary syntax is hard), so you might want to check out the sample_jobfiles directory and/or our video tutorials for some usage hints.

dataset (DTData object or None): The DTData object representing your dataset. If you are using a JSON job file to set up your analysis, you won't have to worry about creating this -- it will be created for you and associated with your DTAnalysis object by DTJob. But if you are creating a DTAnalysis by hand, you'll want to create a DTData object and assign it to this dataset attribute before you attempt to run the analysis.

model (DTModel object or None): This is the DTModel object representing the analysis model with which you intend to analyze your data. Similar to dataset above, you won't have to worry about creating this yourself if you're using job files -- but if you are doing things manually, you'll have to create a DTModel object and then assign it to this model attribute before you run your analysis.

xval_sa (string, probably, or None): The name of whatever "sample attribute," or SA for short (see DTData documentation for more info), you want to use for any cross-validation scheme that requires you to specify an SA. At the moment that is only the loop_over_sa scheme (see above), or the experimental transfer_over_sa scheme if you like to live your life in the DAAAANGER ZOOOONE! So if you aren't using those schemes, this will probably be None (which is to say you don't have to specify anything for this value in your JSON job file if you are making one of those).

If you do need to use this attribute, much like for classify_over above, technically it probably doesn't HAVE to be a string -- we don't do a lot of checking, so it could be anything that Python allows you to use as a key to a dictionary. But don't be a weirdo, just make it a string like a normal person. For example, if you wanted to loop over your subject SA to run a separate analysis for each subject in your dataset, just specify "subject" for this attribute.

output_handler (DTOutput object or None): This is the DTOutput object that will be used to write any output you want written. Similar to model and dataset above, this will be created for you if you're using JSON-based job files, but if you're doing things by hand, you'll have to make a DTOutput object yourself and then assign it to the output_handler attribute if you want your DTAnalysis to actually produce any output files.

slicing (True or None): Currently, this option is not something most users need to worry about. If you are creating your analysis with JSON job files, you don't need to specify it at all. For now, the default is True and any value other than None means that the toolbox handles slicing of data into training, possibly validation, and test subsets (e.g., as specified by the train_val_test option, as described above). The only reason you might set this to None is if you have your own slicing scheme in mind that you are writing your own code for -- but at the moment doing so would also involve setting some other attributes of DTAnalysis that we aren't documenting since it's much trickier and error-prone. If you really want to dance with this particular devil in the pale moonlight, get in touch with the devs and/or poke around in the code a bit. In the future, we plan to allow easier ways for users to implement their own slicing schemes via this option, so stay tuned for changes to come.

others: Other attributes get created on-the-fly even if they aren't provided at initialization... for now, we don't document those extensively as they aren't necessary for creating your analysis and they aren't really meant to be user-accessible in most cases, so if you are using JSON job files to create and run your analysis, you don't really have to worry about them. We may document some of the more relevant ones more extensively in the future, for people who are using this module by writing their own code. For now, we will just mention that ones you might be interested in could include train_acc, train_loss, val_acc, val_loss, test_acc, test_loss, and test_prediction_scores, which contain the relevant values on the performance of the classifier after each iteration of cross-validation is run.



Methods

Note that most users won't need to invoke these directly if they are creating their analyses via JSON job files, but some brief descriptions are provided for the brave few considering writing their own scripts. As always, if you are considering writing your own scripts, you might want to contact the devs for inside details.


__init__( self, nits=None, xval_type=None, nfolds=None, train_val_test=None, scaling_method=None, scaling=None, classify_over=None, backend_options=None, dataset=None, model=None, xval_sa=None, output_handler=None, slicing=True )

(no return value) Initializer function for creating a new DTAnalysis object. Pretty much just assigns all the object's attributes. All of the arguments are optional at this point in time, but most of them will need to get assigned one way or another before you can actually use the object. Note that sync_backend_options_with_model() (see below) does also get called during initialization.


run( self )

(no return value) Once you have your DTAnalysis object all set up and ready to go, this method runs the appropriate kind of analysis (according to the cross-validation scheme and other options you have set up in the object's various attributes). No input arguments to this method. Pretty much just hands things off to one of the other run_X() methods described below. Note that sync_backend_options_with_model() (see below) does get called here too.


run_transfer( self, iteration_num )

(no return value) Runs a transfer-learning analysis (still experimental at the moment; you probably shouldn't use this analysis option at present). Usually you would not call this directly; it would get called by run(). If you do end up calling it directly, iteration_num should be an integer indicating what iteration we're on (looping through iterations is normally handled in run()), which is mainly used in generating output.


run_loop_over_sa( self, iteration_num )

(no return value) Runs an analysis looping over a "sample attribute" (for details, see xval_type under Attributes above). Usually you would not call this directly; it would get called by run(). If you do end up calling it directly, iteration_num should be an integer indicating what iteration we're on (looping through iterations is normally handled in run()), which is mainly used in generating output.


run_single( self, iteration_num )

(no return value) Runs a straightforward "single" type analysis (for details, see single under Attributes above). Usually you would not call this directly; it would get called by run(). If you do end up calling it directly, iteration_num should be an integer indicating what iteration we're on (looping through iterations is normally handled in run()), which is mainly used in generating output.


sync_backend_options_with_model( self, warn_if_none=True, warn_if_different=True )

(no return value) Propagates any backend-specific options in the backend_options attribute (see relevant entry in the Attributes section) to the model attribute. Basically this is because both DTAnalysis and DTModel end up needing access to the backend options at different points, and it seemed cleaner to give them each a copy than to have them reaching into each other or passing the options back and forth continuously. Gets called automatically during object initialization in __init__() and when an analysis is run with run(). Input arguments:

  • warn_if_none (True or False): Whether to display a warning message if the DTAnalysis object does not yet have anything in its model attribute to sync with.

  • warn_if_different (True or False): Whether to display a warning message if the DTModel object in the model attribute already has backend options defined and they are different than the ones held by this DTAnalysis object. If the two sets of backend options are different, the ones in this DTAnalysis object will be copied to the model attribute, which is perfectly fine if this is what you want to happen (for example, you want to change the backend options on a previously-defined DTAnalysis and run it again). But if your two copies of the backend options somehow get out of sync when you don't mean for them to, that might be something you'd like a warning about.


rescale_data( self )

(no return value) Rescales the current training, validation, and/or test datasets in accordance with the desired scaling_method (see Attributes section). Would normally get called at appropriate points of the various run_X() methods, not directly. This functionality is debatably better suited for the DTData class so at some point it may get tweaked some and moved there, but for now it's here in DTAnalysis.



CLASS METHOD

validate_arguments( cls, args )

(returns True or False) Validates the various input arguments used to initialize a DTAnalysis object; returns True if they are all OK and False if something is wrong with them (e.g. missing required attributes, wrong values or data types). Typically used to check the format of a JSON job file, and as such would be called by DTJob when the job file is read in (rather than a user calling this method directly).

Note that currently this method works in the laziest way possible, namely it just tries to create a temporary DTAnalysis object with the arguments given. If that object is created successfully, then it returns True; if some kind of error occurs, it returns False. In the future, hopefully we will make this method a bit smarter so it can actually inspect the arguments and give more useful feedback on exactly what is wrong with them.

Note also that if a Python global variable named dt_debug_mode is defined and set to True, a failed validation will cause an error rather than just making this method return False. Right now dt_debug_mode does default to True, but in the future we intend to some day change this behavior to the more graceful validation failure behavior of simply giving an informative warning.



JSON job file options

Generally what you specify in a JSON job file will be some subset of the attributes listed in the Attributes section above; however, not all attributes will need to be included in a typical job file. So here is a quick recap of the attributes that would typically be a good idea to include in a job file, and what data type they should be. For details on how they behave, see the Attributes section. As always, we recommend that you check out the sample_jobfiles directory and/or our video tutorials for some usage hints.

nits (integer): How many iterations of the analysis to run.

xval_type (string): One of single, loop_over_sa, or transfer_over_sa (this last one is experimental and should probably not be used unless you really love trouble), indicating what type of cross-validation scheme to use. Other options, e.g., kfold, are possible/likely in the future.

nfolds (integer): (Not actually used yet but will be when we implement k-fold cross-validation.)

train_val_test (3-item list of numeric values): Proportions of input data to use as training data, validation data (if relevant), and test data, respectively. Should add up to either 100 or 1, your choice.

scaling_method (string, if provided): What kind of rescaling operation to use. Current options include percentile, standardize, map_range, and mean_center. If you don't want any of those, just don't specify this option at all in your JSON file and your data won't get rescaled at all.

scaling (single numeric value or list of 2 numeric values, if provided): Numeric parameter for scaling_method above, exact meaning dependent on which scaling method is used (see Attributes section for the lowdown). If you don't need a value for this parameter or want to go with whatever the default is for that scaling method, just don't specify this option in your JSON file.

classify_over (string, probably): The name (or potentially other kind of key, if you're a weirdo) of whatever SA (sample attribute) defines the category you want to classify over.

backend_options (dictionary, if provided): Various options that can be passed along to the backend, which essentially means Keras right now (our PyMVPA support does not currently include any backend options). There are potentially lots of these, which you can specify as key/value pairs within the overall dictionary. See Attributes section above for gory details. You will probably also want to check out some of the sample JSON job files to get a full handle on how to write these out.

xval_sa (string, probably): If relevant to your cross-validation scheme (currently only loop_over_sa or the experimental transfer_over_sa), this is the name (or potentially other kind of key, if you're still a weirdo) of the SA (sample attribute) to use for that scheme.