PREFACE TO ALL DOCUMENTATION: We have tried to be as comprehensive, helpful, and accurate as we can in all of these documents, but providing good documentation is always an uphill climb. Our code, and the underlying backend code we rely on, is always changing, which means things can easily go out of date; and doing these kinds of analyses is intrinsically a complicated process, which makes it hard to write documentation that works well for people at all different levels of technical proficiency and familiarity with the underlying concepts and technologies. We really don't want the learning curve to be a barrier to people using this toolbox, so we highly recommend -- especially while the number of users is relatively small and manageable -- getting in touch with the developers if you're confused, don't know where to start, etc., etc. And, of course, if you find any errors, omissions, or inconsistencies! Seriously, don't be a stranger... we are happy to add features, flesh out documentation, walk through setup, and so on, to help this project serve as many users as possible, but that requires hearing from you to help let us know what areas need the most attention. Please see our website (
http://delineate.it/) and click on the contact page to get a link to our Bitbucket repo and an email address for the project devs.
This is the class that is primarily responsible for storing your dataset in memory and doing stuff to it. Your master
DTAnalysis object will have one
DTData object representing the data for that analysis.
Like the other main classes in this toolbox, a
DTData object is typically created by
DTJob from a specification contained in a JSON-format job file, although you can also create one manually in code if you know what you're doing.
samples (NumPy array): The actual data that you want to feed into your analysis, as a NumPy array. If you're doing a PyMVPA analysis, this should be a 2-D array with dimensions
trials x features. If you're doing a deep learning analysis via Keras, it can be any dimensionality -- but the first dimension should still be trials, and the rest of the dimensions should match the configuration of the input layer of your neural network.
Note that if you're creating your analysis from a JSON job file, you normally don't have to specify this attribute directly; rather, you specify a "loader" function (see below) and DTData calls that function to get the dataset when it initializes itself.
loader (passed in as a string when initializing, but gets converted into a function object during initialization and stored as such internally): What data-loading function (aka "loader") to use. When the
DTData object is fully instantiated, this will contain a function object representing the function to call. When you pass this into the
DTData initialization function or specify it in a JSON job file, it can just be a string with the name of the function. In conjunction with
loader_file (see below),
DTData automagically finds the loader function to use. If
loader_file is not specified,
DTData will search through all the files in the "loaders" directory (in the main level of the toolbox folder) and if any of them contain a function whose name is the string in
loader, that's the function it will use. (If more than one file in the "loaders" directory has a function with that name in it,
DTData will just use whichever one it finds first.)
In short, if you're setting up your analysis with a JSON job file, this is just the name of the loader function to use. If you are an advanced user writing your own code using the
DTData class, you can still do the same thing -- but if you don't want to use a loader function at all (and instead want to just load and configure your dataset in your own code, and stick it into the
samples attribute manually), you can pass in the string
"null_loader" for this attribute during
DTData initialization. (
null_loader is a loader function we provide in the
sample_loaders.py file that does nothing but keeps
DTData from throwing an error during initialization, as it normally does throw an error if you try to get away without passing in a loader function. In the future we may implement a less clunky way to skip passing in a loader function, but in the meantime,
null_loader should work fine for any adventurous souls using this class programmatically.)
If you are writing your own loader function, you will probably want to check out our
sample_loaders.py file for some examples on how they work. In brief, every loader function should take exactly one input argument (see
loader_params below for more details), although it is welcome to ignore that argument if it likes. After doing whatever it needs to do in order to load in your data, it should return exactly three outputs, which will become the
fa attributes of the
DTData object, so see the docs above and below on those attributes for more details. Note that we don't currently do much with the
fa attribute, so you are welcome to return
None for that output if you want to, as we currently do in all of our sample loaders. But
sa are critical to any classification, so you probably want to return meaningful values for those unless you are doing something very weird and special.
loader_file (string): If your data-loading function is not in the "loaders" directory, or if you just don't want
DTData snorfling through all your files looking for the one with the right loader in it, you can use this to specify the exact file the loader is in. It just has to be something Python can find, so that can mean an absolute path (starting at the root of the entire filesystem), a relative path that is resolvable relative to whatever current working directory you're running from, or a bare filename that is somewhere on Python's search path. This attribute is optional during initialization (and thus is optional in JSON job files) if the function name specified in the
loader attribute can be found in any of the files in the "loaders" directory (see the notes on the
loader attribute above for more details on how that works).
files (old and busted; see
loader_params): Don't use this attribute; it has been replaced by
loader_params (see below) but has not been taken out of the code yet, so as to avoid breaking old scripts and job files.
loader_params (new hotness; can be almost anything, but a string or list of strings is common): The
loader_params attribute is the new name for what used to be called
files. Regardless of what you call it, the contents of this attribute are what gets passed into your loader function when it runs. Often this would be a string naming a file of data to load (or a list of several such files), hence the old name -- but we realized there were lots of other kinds of arguments that could be useful in telling loader functions what their business should be, and so we generalized the name of this attribute to
loader_params. You can still use
files for now, but you might get a gentle warning about it (telling you deprecation of that name is somewhere in your future), so you may want to start changing over now.
You may want to check out the
sample_loaders.py file (and/or our tutorial videos) to see how
loader_params is used in practice. If you are using one of our loaders, those examples should show you what to pass in for the loader parameters (currently all of our loaders take either a single string or a list of strings, but what those strings are used for can vary across different loader functions). If you're writing your own loader function, you can do pretty much whatever you want as long as the parameters you give that loader match its expectations -- just note that currently, we pass in the value of
loader_params to the loader function without doing any checking to see if the loader function takes any arguments. So, every loader function should be written to take a single argument, even if it ignores that argument. But if your loader function doesn't need any arguments/parameters, it is perfectly OK to specify
loader_params (or equivalently, not to include it in your JSON job file), as
None is the default value for
Note that despite our saying that
loader_params can be almost anything, your loader function should treat it like a list (unless it ignores the parameters entirely), since
DTData initialization automatically list-fies any non-list passed in for the
loader_params attribute. This is basically so loader functions can be written consistently to expect a list of parameters to be passed in, but users are allowed to get a little lazy in their JSON job files and not worry about list-ifying the loader parameters there if there is only one (e.g. a single filename string, which is a very common use case).
sa (dictionary): A dictionary of "sample attributes," using the same terminology of PyMVPA and CoSMoMVPA, so you could also check out the documentation for those packages if the concepts are unclear as they are expressed here. Basically, a sample attribute is any label that can be applied to a sample (aka trial, for most neuroscience datasets) of data. Any dataset should have at least one SA, namely the class over which you intend to perform your classification. But datasets can have as many other SAs as you want, which could get used or could go unused. Another common SA type is subject (participant) number/code, which would likely be used in conjunction with the
loop_over_sa cross-validation scheme to do a separate classification analysis for each subject. (See the
DTAnalysis documentation for more details on that.) You could certainly use the
loop_over_sa scheme to loop over other SAs as well besides subject -- that's just the most obvious example.
At present there aren't many other uses for SAs besides those cases, but there is also always the option to produce output files of SAs, even if they aren't used for anything in the analysis. This is useful if you want to do any post-processing of the output. For example, maybe you want to keep track of which exact trials are randomly selected for the test dataset in each round of classification. To do that, you would include in your dataset an SA (let's call it
trial_id) that gives each trial a unique ID number, and then enable
tags:trial_id in your
DTOutput settings. This will make it fairly easy to, for example, track which specific trials (over potentially many iterations of classification) are consistently classified well and which ones are consistently classified poorly. See the
DTOutput documentation for more details.
In terms of actually implementing the
sa attribute, most users will not need to provide it directly during
DTData initialization; much like
samples above, it is most typically loaded and returned by your loader function. If you're writing your own loader function, or using the
sa attribute in your own code, note that it should be a dictionary, with keys that are typically strings (although we don't explicitly check for this, so if you can figure out some other type of key you want to use and make it work, knock yourself out). As noted in the preceding paragraphs, typical keys would be things like
condition, or whatever you want to call the main attribute you're classifying over). The values for each key should be a Python list or 1-D NumPy array whose length is equal to the number of samples/trials in the
samples attribute. What the individual labels/SAs are within that list/array are up to you; you can use numeric labels like 1, 2, 3, or string labels like "face" and "scene"... or presumably you COULD use more esoteric data types if you really want to make life harder on yourself, but most people are going to use either numbers or strings. If you are writing your own loader or other code that uses the
sa attribute directly, you might want to check out our
sample_loaders.py file to see how we do it in there.
fa (dictionary): A dictionary of "feature attributes," which are similar to "sample attributes" as described directly above, but for the features of your dataset rather than the samples/trials. For example, in fMRI data, the most obvious example of an FA would be a voxel ID (or you could have three separate FAs for its x-, y-, and z-coordinates). In EEG data, reasonable FAs would be things like electrode labels or time codes or both. This is also terminology borrowed from PyMVPA and CoSMoMVPA, so you could also check out the documentation for those packages if the concepts are unclear as they are expressed here.
At the moment we don't actually do anything with FAs, so you don't have to, either -- but if you write your own loader, there isn't anything stopping you from loading them and returning them. Presumably some people might find them useful if they are using the toolbox programmatically and writing their own code around it, which is why we make it possible to store FAs conveniently in
DTData alongside everything else. But if you are using JSON job files and/or you don't really care about any feature attributes, you can forget you ever heard of them. (At least for now -- we may implement more explicit functionality with FAs in the future.)
auto_load (True or False): Most people won't need to worry about this attribute and can just ignore it, in which case it defaults to
True. Basically, if
False is provided instead, then
loader_params are ignored during initialization of the
DTData object. So, you would probably never specify
False if you are using the toolbox with JSON job files. If you are writing your own code using the toolbox modules programmatically, you might specify
False if you want to create a mostly empty
DTData object and fill in the actual data later, or if you want to directly feed in
DTData initialization and ignore the whole concept of loader functions. Note that if you don't want to entirely ignore the concept of loader functions but also don't want them getting called when
DTData is initialized, you can specify
auto_load, and then manually call
load_data() (see below under Methods) whenever you're ready to set the loader function in action, since all
auto_load really does is call
load_data() at the end of a
DTData object's initialization.
others: Other attributes get created on-the-fly even if they aren't provided at initialization... for now, we don't document those extensively as they aren't necessary for creating your analysis and they aren't really meant to be user-accessible in most cases, so if you are using JSON job files to create and run your analysis, you don't really have to worry about them. We may document some of the more relevant ones more extensively in the future, for people who are using this module by writing their own code. For now, we will just mention that the only one you might be interested in is
mask, which basically contains a representation of which samples/trials are/aren't getting used at the moment... for example, if using the
xval_over_sa cross-validation scheme to loop over subjects and do a separate classification for each one, each subject is masked in (and all others masked out) in turn. But that is a fairly deep implementation detail, so you probably don't even care about that one (if you do, though, there's more info in the
mask_sa methods below).
Note that most users won't need to invoke these directly if they are creating their analyses via JSON job files, but some brief descriptions are provided for the brave few considering writing their own scripts. As always, if you are considering writing your own scripts, you might want to contact the devs for inside details.
(no return value) Initializer function for creating a new
DTData object. Pretty much just assigns all the object's attributes, plus a little basic checking of the loader function and loader parameters. Also, if
auto_load is set to
load_data() method (see below) will get called at the end of initialization and presumably cause the
fa attributes to get loaded in. All of the arguments are optional at this point in time, but most of them will need to get assigned one way or another before you can actually use the object.
(no return value) One-liner function that just calls the loader function (stored in
self.loader), passing it any necessary parameters (stored in
self.loader_params), and saves the return values of the loader function (of which there should be three) into
(returns a list of logical indices) Most users, even if writing their own code, don't need to worry about this method; unlike lipstick or Preparation H, it is intended mainly for internal use. Basically, it just logically "ands" together any existing mask with the index array that is passed in. Why would someone want to do that? Well, unless you're actually developing the toolbox, you probably shouldn't worry your pretty little face over it.
(returns a whole bunch of stuff) Another method that would rarely need to be called directly by a user, even if writing their own code; usually it gets called by one of the
DTAnalysis.run_X() functions. As such, we'll be brief; if you really want to call it manually, the code is fairly self-explanatory, or feel free to get in touch with the devs.
Basically, this chops up the dataset into training, validation, and test subsets according to the specified proportions, ensuring equal numbers of trials/samples from each class in each of those subsets, so that training will be balanced across classes. Returns all of those things as well as their category labels (converted to a numeric scheme starting at 0; see
map_labels() below) and, optionally, numeric indices of the trials/samples in the test dataset, which is a potential output option (see
DTOutput documentation for more on this).
(no return value) Another method typically called by
DTAnalysis that users will rarely need to call directly. Sets the
mask attribute (see above in Attributes) to mask in the value(s) specified in
keep_values for the sample attribute specified by
sa_name. Used, for example, when
DTAnalysis runs a separate classification for each subject (or any other sample attribute, but subject is the most obvious usage case); each time through the loop, the subject currently being analyzed is masked in using this method, and everyone else is masked out.
(returns a function object for a loader function) One of our dirty little secrets; don't look too close! Actually not that bad, but this is the method that, when passed in the name of a loader function (such as that specified by the user in a JSON job file or during
DTData initialization), snarfles through either the
loader_file (see Attributes above), or through all the files in the "loaders" directory in the main level of the toolbox (if no loader file is specified) to find the right function. The way we do it is not not gross, but it gets the job done.
(returns True or False) Validates the various input arguments used to initialize a
DTData object; returns
True if they are all OK and
False if something is wrong with them (e.g. missing required attributes, wrong values or data types). Typically used to check the format of a JSON job file, and as such would be called by
DTJob when the job file is read in (rather than a user calling this method directly).
Note that currently this method works in the laziest way possible, namely it just tries to create a temporary
DTData object with the arguments given. If that object is created successfully, then it returns
True; if some kind of error occurs, it returns
False. In the future, hopefully we will make this method a bit smarter so it can actually inspect the arguments and give more useful feedback on exactly what is wrong with them.
Note also that if a Python global variable named
dt_debug_mode is defined and set to
True, a failed validation will cause an error rather than just making this method return
False. Right now
dt_debug_mode does default to
True, but in the future we intend to some day change this behavior to the more graceful validation failure behavior of simply giving an informative warning.
(returns a PyMVPA Dataset object) PyMVPA packages up data differently than Keras does (and differently from the way we store it in
DTData) so this little utility function takes a chunk of data and a list of labels and returns it in the format that PyMVPA is expecting. Users probably won't need to call this much; it is mainly used by
DTModel.train_pymvpa() to pack up the training data in the correct way.
(returns a one-hot-encoded NumPy matrix) This is a function copied straight from Keras (with appropriate license), so that we don't have to have Keras as a dependency, for folks who might want to use this toolbox with PyMVPA only (or other backends, when/if we add them). Plus a little error checking of our own. Takes in a vector of integers representing class codes, and returns a matrix representing the same classes as a set of one-hot-encoded binary values. Used mainly by
train_val_test_to_categorical (see below).
(returns a one-hot-encoded version of the training, validation, and test datasets passed in) Pretty straightforward, just categorical-izes (changes to one-hot encoding) the training, validation, and/or test datasets that it is given as arguments (which are presumed to be encoded with integer class codes). If a validation set is passed in, double-checks to make sure it has the same number of categories/classes as the training set. (The same is not enforced for the test set, because there are occasions where you might not have all the categories in your test set and that is perfectly OK.) Most of the time the toolbox handles when to convert class labels from integer to one-hot, but could theoretically be a useful little function if you're rolling your own analyses.
(returns a set of "zero-based" labels and a dictionary for converting back the other way) Yet another little utility function for converting class labels. This one takes in a list/array of class labels that can be pretty much anything (though presumably would be either numbers or strings) and converts them to an integer representation, starting with zero (which is what PyMVPA expects; Keras wants a one-hot-encoded version, so for Keras this would be an intermediate step to be followed with the categorical conversions described above). Basically enables the convenience of users being able to label their data however they like (e.g., as 'face', 'scene', 'object' or something like that) and have this toolbox worry about making those make sense to our classification backends. Also returns a dictionary that maps these integer-ized class labels back to whatever they were originally, which is used by
DTOutput (in conjunction with
unmap_labels() below) to recreate the original labeling scheme so that most users never have to worry about how their labels were transmogrified during the analysis process. That said, if you are writing your own code, this could be a useful little utility for you as well for integer-izing your string (or whatever) labels.
(returns a set of NON-"zero-based" labels in the user's original labeling scheme) Basically does the exact inverse of the
map_labels() function described above. Takes in a set of class labels in an integer coding scheme (that starts at zero) in conjunction with the dictionary returned by
map_labels(), and returns a list of class labels in whatever coding scheme (strings, etc.) the user originally coded their classes with.
Generally what you specify in a JSON job file will be some subset of the attributes listed in the Attributes section above; however, not all attributes will need to be included in a typical job file. So here is a quick recap of the attributes that would typically be a good idea to include in a job file, and what data type they should be. For details on how they behave, see the Attributes section. As always, we recommend that you check out the
sample_jobfiles directory and/or our video tutorials for some usage hints.
loader (string): The name of your loader function.
loader_file (string): The name of the file that your loader function is in. This can be an absolute path, a relative path (relative to your current working directory when you started the
delineate.py script), a bare filename (as long as the directory it's in is somewhere in Python's search path), or nothing at all (i.e., you can just skip specifying this option, if your loader function is in any of the files located within the "loaders" directory in the main level of the toolbox folder).
files (string or list of strings): Don't use this anymore, as it has been replaced by
loader_params (see below), but basically this was the name of the file (or a list of filenames) to pass into your loader function.
loader_params (potentially anything, but probably a string or list of strings): Replacement for
files above since that name implied too narrow a usage case. What you put in this parameter depends on what loader function you're using, since it basically just gets passed along to the loader. So if you didn't write the loader yourself, you might need to check out its code or documentation to know what to put in here. Most commonly this would probably be either a single filename or a list of filenames for your loader function to process, but if the loader allows something else (like numeric parameters or whatever), that works too.