DNAnexus

DNAnexus paths on stor are prefixed with dx:// and have two components: dx://<PROJECT>:<FILE_OR_FOLDER> where project and file can be virtual paths (i.e., human names) or canonical paths (opaque globally unique IDs that the platform assigns) - see below for more details.

Canonical Paths on DNAnexus

Files on DNAnexus have a globally unique immutable handle (called a dxid) and also a virtual path in each project. DNAnexus only allows one copy of a file to be in a specific project. Also, since the canonicalized path to the file on DNAnexus is represented by:

'project-j47b1k3z8Jqqv001213v312j1:file-47jK67093475061g3v95369p'

having multiple locations for a single file within a project is infeasible. However, one canonical file can be present in multiple projects at different paths.

Thus, stor has two subclass implementations of DXPath: DXCanonicalPath and DXVirtualPath. As the name suggests, DXCanonicalPath deals with paths like:

Path('dx://project-j47b1k3z8Jqqv001213v312j1:file-47jK67093475061g3v95369p')
OR
Path('dx://project-j47b1k3z8Jqqv001213v312j1:/file-47jK67093475061g3v95369p')
OR
Path('dx://project-j47b1k3z8Jqqv001213v312j1')

DXVirtualPath handles paths that have any human readable element in them:

Path('dx://project-j47b1k3z8Jqqv001213v312j1:/path/to/file.txt')
OR
Path('dx://myproject:/path/to/file.txt')
OR
Path('dx://myproject:path/to/file.txt')
You can obtain the DXCanonicalPath from a DXVirtualPath and vice versa, like so::
>>> stor.Path('dx://StorTesting:/1.bam').canonical_path
DXCanonicalPath("dx://project-FJq288j0GPJbFJZX43fV5YP1:/file-FKzjPgQ0FZ3VpBkpKJz4Vb70")
>>> stor.Path('dx://StorTesting:1.bam').canonical_path.canonical_path
DXCanonicalPath("dx://project-FJq288j0GPJbFJZX43fV5YP1:/file-FKzjPgQ0FZ3VpBkpKJz4Vb70")
>>> stor.Path('dx://StorTesting:/1.bam').virtual_path.canonical_path
DXCanonicalPath("dx://project-FJq288j0GPJbFJZX43fV5YP1:/file-FKzjPgQ0FZ3VpBkpKJz4Vb70")
>>> stor.Path('dx://project-FJq288j0GPJbFJZX43fV5YP1:/file-FKzjPgQ0FZ3VpBkpKJz4Vb70').virtual_path
DXVirtualPath("dx://StorTesting:/1.bam")

The canonical_path and virtual_path attributes are cached and hence, each call to these properties doesn’t invoke a new API request to the DX server.

Directories on DNAnexus

DNAnexus has the concept of directories on the platform like posix (unlike Swift/S3). These directories can be empty, and also have names with extensions. Two folders with the same parent path cannot have the same name, i.e., no duplicate folders are allowed (like posix).

Note here that folders are not actual resources on the DNAnexus platform. They are only handled through metadata on the server and as a result, do not have a canonical ID of their own. As a result, paths to folders can only be DXVirtualPath. Accessing the DXVirtualPath.canonical_path property for folders would raise an error.

The summary of different behaviors for different filesystems is presented here. The individual details are explained further below.

Filesystem Differences

Topic

Posix

Swift/S3

DNAnexus

File/ Directory names

Filenames without extension and dirnames with ext are allowed. A dir and a file can have the same name in same path

Anonymous file / dir names are not allowed on swift paths

Filenames without extension and dirnames with ext are allowed. A dir and a file can have the same name in same path

Duplicates

Not allowed

Not allowed

Duplicate filenames are allowed. Duplicate dir names are not allowed

list

Recurisvely lists the files and empty directories

Recurisvely lists the files and empty directory markers.

Recursively lists the files within DX folder path. No empty directories are listed.

list with prefix

Treated as prefix to absolute path

Treated as prefix to absolute path

Treated as path to a subfolder to list

list on filepath

Returns [filepath]

Returns [filepath]

Returns []

listdir on non-existent folder

Returns []

Returns []

Raises NotFoundError

copy, target exists and is file

Overwrites the file

Overwrites the file

Deletes the existing file before copying over

copy, target exists and is dir

Not allowed

Not possible

Copied within existing dir

serverside copy/copytree

Allowed

Not allowed

Allowed

copytree, target exists as file

Not allowed

Allowed

Allowed

copytree, target exists as dir

Merges the two directories

Overwrites the dir

Copies inside existing directory. If root folder is moved, project name is used while copying if needed.

Files on DNAnexus

Files stored on the DX platform are immutable. This is because the files are internally stored in AWS while the metadata handling is taken care of by the platform. Hence, once a file is uploaded, it cannot be modified.

When one file is copied to another project, only additional metadata is produced, while the underlying file on AWS remains the same. This is essential. The same file with the same canonical ID will appear in both projects, and can have different folder paths. Deleting a file from one project is possible, which deletes the metadata and leaves the file untouched in other projects.

DXPath on stor

Project is always required for DNAnexus instances:

>>> Path('dx://path/to/file')
Traceback (most recent call last):
...
<exception>

but projects are normalized:

>>> Path('dx://myproject')
DXVirtualPath('dx://myproject:')

Duplicate names on DNAnexus

A single virtual path can refer to multiple files (and even a folder) simultaneously! Currently, stor will error if a specific virtual path resolves to multiple files (use the dx-tool in these cases), but you can always use a canonical path.:

$ dx upload myfile.txt -o MyProject:/myfile.txt
$ dx upload anotherfile.txt -o MyProject:/myfile.txt
$ stor cat dx://MyProject:/myfile.txt
# MultipleObjectsSameNameError: Multiple objects found at path (dx://StorTesting:/1.bam). Try using a canonical ID instead

When a folder has the same name as a file, stor uses the method you call to check for a folder or a file (i.e., DXPath.listdir will assume folder, DXPath.stat will assume file).

DXPath

exception stor.dx.DNAnexusError(message, caught_exception=None)[source]

Base class for all remote errors thrown by this DX module

class stor.dx.DXCanonicalPath(pth)[source]

Represents fully canonicalized DNAnexus paths: ‘dx://project-{dxID}:/file-{dxID}’ or ‘dx://project-{dxID}:’

property canonical_path

Get DXCanonicalPath instance for path

property canonical_project

The canonical dxid for the project

property canonical_resource

The canonical dxID of the file resource

exists()[source]

Checks existence of the path.

Returns

True if the path exists, False otherwise.

Return type

bool

normpath()[source]

Normalize path following linux conventions (keeps drive prefix)

splitpath()[source]

Wrapper around base splitpath function which calls splitpath on the normpath of self

virtual_path

The DXVirtualPath instance equivalent to the canonical path within the specified project

property virtual_project

The virtual (human-readable) name of the project associated with this path

property virtual_resource

The virtual (human-readable) path of the resource associated with this path

class stor.dx.DXPath(pth)[source]

Provides the ability to manipulate and access resources on DNAnexus servers with stor interfaces.

abspath()

No-op for ‘abspath’

clear_cached_properties()[source]

Clears all cached properties in DXPath objects.

The canonical and virtual forms of DXPath objects are cached to not hit the server for every transformation call. However, after copy/remove/rename, the cached information is outdated and needs to be cleared.

property content_type

Get content type for DXObject. Returns empty string if not present or is project/

copy(dest, raise_if_same_project=False, **kwargs)[source]

Copies data object to destination path.

If dest already exists as a directory on the DX platform, the file is copied underneath dest directory with original name.

If the target destination already exists as a file, it is first deleted before the copy is attempted.

For example, assume the following file hierarchy:

dxProject/
- a/
- - 1.txt

anotherDxProject/

Doing a copy of 1.txt to a new destination of b.txt is performed with:

Path('dx://dxProject:/a/1.txt').copy('dx://anotherDxProject/b.txt')

The end result for anotherDxProject looks like:

anotherDxProject/
- b.txt

And, if the destination already exists as a directory, i.e. we have:

dxProject/
- a/
- - 1.txt

anotherDxProject/
- b.txt/

Performing copy with following command:

Path('dx://dxProject:/a/1.txt').copy('dx://anotherDxProject/b.txt')

Will yield the resulting structure to be:

anotherDxProject/
- b.txt/
- - 1.txt

If the source file and destination belong to the same project, the files are moved instead of copied, if the raise_if_same_project flag is False; because the same underlying file cannot appear in two locations in the same project.

If the final destination for the file already is an existing file, that file is deleted before the file is copied.

Parameters
  • dest (Path|str) – The destination file or directory.

  • raise_if_same_project (bool, default False) – Controls moving file within project instead of cloning. If True, raises an error to prevent this move. Only takes effect when both source and destination are within the same DX Project

Raises
  • DNAnexusError – When copying within same project with raise_if_same_project=False

  • NotFoundError – When the source file path doesn’t exist

copytree(dest, raise_if_same_project=False, **kwargs)[source]

Copies a source directory to a destination directory. This is not an atomic operation.

If the destination path already exists as a directory, the source tree including the root folder is copied over as a subfolder of the destination.

If the source and destination directories belong to the same project, the tree is moved instead of copied. Also, in such cases, the root folder of the project cannot be the source path. Please listdir the root folder and copy/copytree individual items if needed.

For example, assume the following file hierarchy:

project1/
- b/
- - 1.txt

project2/

Doing a copytree from project1:/b/ to a new dx destination of project2:/c is performed with:

Path('dx://project1:/b').copytree('dx://project2:/c')

The end result for project2 looks like:

project2/
- c/
- - 1.txt

If the destination path directory already exists, the folder is copied as a subfolder of the destination. If this new destination also exists, a TargetExistsError is raised.

If the source is a root folder, and is cloned to an existing destination directory or if the destination is also a root folder, the tree is moved under project name.

Refer to dx docs for detailed information.

Parameters
  • dest (Path|str) – The directory to copy to. Must not exist if its a posix directory

  • raise_if_same_project (bool, default False) – Allows moving files within project instead of cloning. If True, raises an error to prevent moving the directory. Only takes effect when both source and destination directory are within the same DX Project

Raises
  • DNAnexusError – Attempt to clone within same project and raise_if_same_project=True

  • TargetExistsError – All possible destinations for source directory already exist

  • NotFoundError – source directory path doesn’t exist

dirname()[source]

Returns directory name of path. Returns self if path is a project.

To avoid making API calls, canonical paths will return the project ID

download(dest, **kwargs)[source]

Download a directory.

Parameters

dest (Path) – The output directory

Raises

NotFoundError – When source or dest path is not a directory

download_object(dest, **kwargs)[source]

Download a single path or object to file.

Parameters

dest (Path) – The output file

Raises

NotFoundError – When source path is not an existing file

download_objects(dest, objects)[source]

Downloads a list of objects to a destination folder.

Note that this method takes a list of complete relative or absolute OBSPaths to objects (in contrast to taking a prefix). If any object does not exist, the call will fail with partially downloaded objects residing in the destination path.

Parameters
  • dest (str) – The destination folder to download to. The directory will be created if it doesnt exist.

  • objects (List[str|PosixPath|SwiftPath]) – The list of objects to download. The objects can be paths relative to the download path or absolute obs paths. Any absolute obs path must be children of the download path

Returns

A mapping of all requested objs to their location on

disk

Return type

dict

Examples

To download a objects to a dest/folder destination:

from stor import Path
p = Path('dx://project:/dir/')
results = p.download_objects('dest/folder', ['subdir/f1.txt',
                                             'subdir/f2.txt'])
print results
{
    'subdir/f1.txt': 'dest/folder/subdir/f1.txt',
    'subdir/f2.txt': 'dest/folder/subdir/f2.txt'
}

To download full obs paths relative to a download path:

from stor import Path
p = Path('dx://project:/dir/')
results = p.download_objects('dest/folder', [
    'dx://project:/dir/subdir/f1.txt',
    'dx://project:/dir/subdir/f2.txt'
])
print results
{
    'dx://project:/dir/subdir/f1.txt': 'dest/folder/subdir/f1.txt',
    'dx://project:/dir/subdir/f2.txt': 'dest/folder/subdir/f2.txt'
}
exists()[source]

Checks whether path exists on local filesystem or on swift.

For directories on swift, checks whether directory sentinel exists or at least one subdirectory exists

expanduser()

No-op for ‘expanduser’

getsize()[source]

Returns size, in bytes of path.

glob(pattern, condition=None, canonicalize=False)[source]

Glob for pattern relative to this directory.

isdir()[source]

Determine if path is directory-like (i.e., it’s a project, or it’s a folder that can be listed)

Returns

True if path is an existing folder path or project

Return type

bool

isfile()[source]

Determine an object exists at the specified path

Returns

True if path points to an existing file

Return type

bool

joinpath(*others)[source]

Wrapper around base joinpath function which converts the first part to normpath before joining with others.

list(canonicalize=False, starts_with=None, limit=None, classname=None, condition=None)[source]

List contents using the resource of the path as a prefix. This will only list the file resources (and not empty directories like other OBS).

Warning

Prefer list_iter() to this method in production code. If there are many files (i.e., more than 1-2K) to list, this method may take a long time to return and use a lot of memory to construct all of the objects.

Examples

>>> Path('dx://MyProject:/my/path/').list(canonicalize=False)
[Path('dx://MyProject:/my/path/to/file.txt, ...]
>>> Path('dx://MyProject:/my/path/').list(canonicalize=True)
[Path('dx://project-123:file-123'), ...]
Parameters
  • canonicalize (bool, default False) – if True, return canonical paths

  • starts_with (str) – Allows for an additional search path to be appended to the resource of the dx path. Note that this resource path is treated as a directory

  • limit (int) – Limit the amount of results returned

  • classname (str) – Restricting class : One of ‘record’, ‘file’, ‘gtable, ‘applet’, ‘workflow’

  • condition (function(results) -> bool) – The method will only return when the results matches the condition.

Returns

Iterates over listed files that match an optional pattern.

Return type

List[DXPath]

list_iter(canonicalize=False, starts_with=None, limit=None, classname=None)[source]

Iterable that yields objects under prefix (especially useful when a folder may have many small files)

Note that this is a wrapper function to walkfiles.

Parameters
  • canonicalize (bool, default False) – if True, return canonical paths

  • starts_with (str) – Allows for an additional search path to be appended to the resource of the dx path. Note that this resource path is treated as a directory

  • limit (int) – Limit the amount of results returned

  • classname (str) – Restricting class : One of ‘record’, ‘file’, ‘gtable, ‘applet’, ‘workflow’

Returns

Iterates over listed files that match an optional pattern.

Return type

Iterable[DXPath]

listdir(only='all', canonicalize=False)[source]

List the path as a dir, returning top-level directories and files.

Parameters
  • canonicalize (bool, default False) – if True, return canonical paths

  • only (str) – “objects” for only objects, “folders” for only folders, “all” for both

Returns

Iterates over listed files directly within the resource

Return type

List[DXPath]

Raises

NotFoundError – When resource folder is not present on DX platform

listdir_iter(canonicalize=False)[source]

Iterate the path as a dir, returning top-level directories and files.

Parameters

canonicalize (bool, default False) – if True, return canonical paths

Returns

Iterates over listed files directly within the resource

Return type

Iterable[DXPath]

makedirs_p(mode=511)[source]

Make directories, including parents on DX from DX folder paths.

Parameters

mode – unused, present for compatibility (access permissions are managed at project level)

property name

File or folder name of the path. Empty string for projects or folders with trailing slash.

Makes no API calls to server, canonical paths are treated normally, and the basename of the path is returned.

normpath()[source]

Normalize path following linux conventions (keeps drive prefix)

open(mode='r', encoding=None)[source]

Opens a OBSFile that can be read or written to and is uploaded to the remote service.

For examples of reading and writing opened objects, view OBSFile.

Parameters
  • mode (str) – The mode of object IO. Currently supports reading (“r” or “rb”) and writing (“w”, “wb”)

  • encoding (str) – text encoding to use. Defaults to locale.getpreferredencoding(False)

Returns

The file object for Swift/S3/DX.

Return type

OBSFile

Raises
property project

The project name from the path or None

read_object()[source]

Reads an individual object from DX. Note dxpy for Py3 automatically decodes the DXFile.read using utf-8.

Returns

the raw bytes from the object on DX.

Return type

bytes

realpath()

No-op for ‘realpath’

remove()[source]

Removes a single object from DX platform

Raises

ValueError – The path is invalid.

property resource

The virtual or canonical path to the file within the project (as a POSIXPath).

Examples

>>> Path('dx://project:dir/file').resource
PosixPath('dir/file')
>>> Path('dx://project-123:file-456').resource
PosixPath('file-456')

NOTE: to avoid making API requests, this operation only uses the local string

rmtree()[source]

Removes a resource and all of its contents. The path should point to a project or directory.

Raises

NotFoundError – The path points to a nonexistent directory

stat()[source]

Performs a stat on the path. This method follows (slightly vague) behavior of dxpy’s describe method. It works as expected for a virtual path. However, for a canonical path:

Path(‘dx://project-123:/file-123’)

say project-123 exists and file-123 exists, but file-123 doesn’t exist inside project-123, stat will still return the describe response on file-123 (with its default project).

Use stor.exists to check if a canonical path actually exists.

Raises
temp_url(lifetime=300, filename=None)[source]

Obtains a temporary URL to a DNAnexus data-object.

If DX_FILE_PROXY_URL or [dx] file_proxy_url= is set, will use that to construct a path instead, e.g.:

>>> stor.Path('dx://proj:/folder/mypath.csv').temp_url()
'https://dl.dnanex.us/F/D/awe1323/mypath.csv'
>>> with stor.settings.use({'dx': {'file_proxy_url':
...     'https://my-dnax-proxy.example.com/gateway'}):
... stor.Path('dx://proj:/folder/mypath.csv').temp_url()
'https://my-dnax-proxy.example.com/gateway/proj/folder/mypath.csv'

The file proxy is assumed to be a service that, when given DX path and project, will proxy through to DNAnexus to render content.

Parameters
  • lifetime (int) – The time (in seconds) the temporary URL will be valid (only for temp URL generation)

  • filename (str, optional) – A urlencoded filename to use for attachment, otherwise defaults to object name (to use no filename at all, use filename='')

Raises
  • ValueError – The path points to a project

  • ValueErrorfile_proxy_url is set and filename does not match object name

  • ValueErrorfile_proxy_url does not look like a valid http(s) path

  • NotFoundError – The path could not be resolved to a file (when file_proxy_url unset)

to_url()[source]

For compatibility with OBS - returns temp_url()

upload(to_upload, **kwargs)[source]

Upload a list of files and directories to a directory.

This is not a batch level operation. If some file errors, the files uploaded before will remain present.

Parameters

to_upload (List[Union[str, OBSUploadObject]]) – A list of posix file names, directory names, or OBSUploadObject objects to upload.

Raises
walkfiles(pattern=None, canonicalize=False, recurse=True, starts_with=None, limit=None, classname=None)[source]

Iterates over listed files that match an optional pattern.

Parameters
  • pattern (str) – glob pattern to match the filenames against.

  • canonicalize (bool, default False) – if True, return canonical paths

  • recurse (bool, default True) – if True, look in subfolders of folder as well

  • starts_with (str) – Allows for an additional search path to be appended to the resource of the dx path. Note that this resource path is treated as a directory

  • limit (int) – Limit the amount of results returned

  • classname (str) – Restricting class : One of ‘record’, ‘file’, ‘gtable, ‘applet’, ‘workflow’

Returns

Iterates over listed files that match an optional pattern.

Return type

Iter[DXPath]

write_object(content, **kwargs)[source]

Writes an individual object to DX.

Note that this method writes the provided content to a temporary file before uploading. This allows us to reuse code from DXPath’s uploader (multi part object support, etc.).

Parameters
  • content (bytes) – raw bytes to write to OBS

  • **kwargs – Keyword arguments to pass to DXPath.upload

class stor.dx.DXVirtualPath(pth)[source]

Class Handler for DXPath of form ‘dx://MyProject:/a/b/c’ or ‘dx://project-{uuid}:/b/c’

property canonical_path

The unique file or project that matches the given path

canonical_project

The dxid of the unique project for the given project name. Only resolves project user has access to.

Raises
canonical_resource

The dxid of the file at this path

Raises
exists()[source]

Checks existence of the path.

Returns

True if the path exists, False otherwise.

Return type

bool

normpath()[source]

Normalize path following linux conventions (keeps drive prefix)

splitpath()[source]

Wrapper around base splitpath function which calls splitpath on the normpath of self

property virtual_path

Path as DXVirtualPath

virtual_project

Returns the virtual name of the project associated with the DXVirtualPath

property virtual_resource

Human-readable path to the object in its DNAnexus Project (as PosixPath)

exception stor.dx.InconsistentUploadDownloadError(message, caught_exception=None)[source]

Thrown during checksum mismatch or part length mismatch..

exception stor.dx.MultipleObjectsSameNameError(message, caught_exception=None)[source]

Thrown when multiple objects exist with the same name

Currently, we throw this when trying to get the canonical project from virtual path and two or more projects were found with same name

exception stor.dx.ProjectNotFoundError(message, caught_exception=None)[source]

Thrown when no project exists with the given name

Currently, we throw this when trying to get the canonical project from virtual path and no project was found with same name

Copy and copytree by example

copy and copytree behave differently when the target output path exists and is a folder. (this holds DX -> DX and also POSIX -> DX)

Copy/copytree example

command

output path (no folder)

output path (folder exists)

stor cp myfile.vcf dx://project2:/call

dx://project2:/call

dx://project2:/call/myfile.vcf

stor cp -r ./trip-photos dx://newproject:/all

dx://newproject:/all

dx://newproject/all/trip-photos

Note that if the output path exists and is a file, the file will be overwritten

List, Listdir and walkfiles

stor ls, stor list and stor walkfiles for DXPaths take in a --canonicalize flag which returns the results with canonical dxIDs instead of human readable virtual paths. This is especially useful for manipulating paths directly using dx-toolkit through piping. This flag is ignored when passed for other paths (swift/s3/posix).

Open on stor

The open functionality in dx works by returning an instance of stor.obs.OBSFile like with other OBS paths(Swift/S3). Although the python package of DNAnexus dxpy also has an open functionality on their DXFile, this is not carried over to stor. One of the main reasons to do this is to wrap the scope of dxfile.open to what is expected of stor. As an example, dxpy’s version of DXFile.open does not have readline and readlines methods for reading the file. On the other hand, dxpy does support an ‘append’ mode to their DXFile.open which can be confusing to a stor user, because there are very restricted scenarios this can be used in, and the user would have to know the different internal states of a file on the DNAnexus platform, what they mean, when they happen, what operations are allowed on them, etc. By instantiating stor.obs.OBSFile for open, we maintain the support that is standard by stor, without any real decrease in functionality.

The dx variable exposed as a setting ‘wait_on_close’, has a default value of 0 (seconds). This variable determines how long an stor.open action waits for the file to go to ‘closed’ state on DNAnexus. If a file is not in the ‘closed’ state internally on the platform, it cannot be read from, or downloaded. If you need consistency from reading right after writing, then you should set wait_on_close to be a value > 0. The default is kept so that in the event of multiple uploads, each upload doesn’t wait for wait_on_close seconds before initiating the next upload. However, setting wait_on_close > 0 can cause unexpected performance issues, depending on the performance of the DNAnexus platform, so you need to know what you’re doing when changing this.