DNAnexus¶
DNAnexus paths on stor are prefixed with dx://
and have two components:
dx://<PROJECT>:<FILE_OR_FOLDER>
where project and file can be virtual paths (i.e., human names) or canonical paths
(opaque globally unique IDs that the platform assigns) - see below for more details.
Canonical Paths on DNAnexus¶
Files on DNAnexus have a globally unique immutable handle (called a dxid) and also a virtual path in each project. DNAnexus only allows one copy of a file to be in a specific project. Also, since the canonicalized path to the file on DNAnexus is represented by:
'project-j47b1k3z8Jqqv001213v312j1:file-47jK67093475061g3v95369p'
having multiple locations for a single file within a project is infeasible. However, one canonical file can be present in multiple projects at different paths.
Thus, stor has two subclass implementations of DXPath
: DXCanonicalPath
and
DXVirtualPath
. As the name suggests, DXCanonicalPath
deals with paths like:
Path('dx://project-j47b1k3z8Jqqv001213v312j1:file-47jK67093475061g3v95369p')
OR
Path('dx://project-j47b1k3z8Jqqv001213v312j1:/file-47jK67093475061g3v95369p')
OR
Path('dx://project-j47b1k3z8Jqqv001213v312j1')
DXVirtualPath
handles paths that have any human readable element in them:
Path('dx://project-j47b1k3z8Jqqv001213v312j1:/path/to/file.txt')
OR
Path('dx://myproject:/path/to/file.txt')
OR
Path('dx://myproject:path/to/file.txt')
- You can obtain the
DXCanonicalPath
from aDXVirtualPath
and vice versa, like so:: >>> stor.Path('dx://StorTesting:/1.bam').canonical_path DXCanonicalPath("dx://project-FJq288j0GPJbFJZX43fV5YP1:/file-FKzjPgQ0FZ3VpBkpKJz4Vb70") >>> stor.Path('dx://StorTesting:1.bam').canonical_path.canonical_path DXCanonicalPath("dx://project-FJq288j0GPJbFJZX43fV5YP1:/file-FKzjPgQ0FZ3VpBkpKJz4Vb70") >>> stor.Path('dx://StorTesting:/1.bam').virtual_path.canonical_path DXCanonicalPath("dx://project-FJq288j0GPJbFJZX43fV5YP1:/file-FKzjPgQ0FZ3VpBkpKJz4Vb70")
>>> stor.Path('dx://project-FJq288j0GPJbFJZX43fV5YP1:/file-FKzjPgQ0FZ3VpBkpKJz4Vb70').virtual_path DXVirtualPath("dx://StorTesting:/1.bam")
The canonical_path and virtual_path attributes are cached and hence, each call to these properties doesn’t invoke a new API request to the DX server.
Directories on DNAnexus¶
DNAnexus has the concept of directories on the platform like posix (unlike Swift/S3). These directories can be empty, and also have names with extensions. Two folders with the same parent path cannot have the same name, i.e., no duplicate folders are allowed (like posix).
Note here that folders are not actual resources on the DNAnexus platform.
They are only handled through metadata on the server and as a result, do not
have a canonical ID of their own. As a result, paths to folders can only be
DXVirtualPath
. Accessing the DXVirtualPath.canonical_path
property for folders
would raise an error.
The summary of different behaviors for different filesystems is presented here. The individual details are explained further below.
Topic |
Posix |
Swift/S3 |
DNAnexus |
---|---|---|---|
File/ Directory names |
Filenames without extension and dirnames with ext are allowed. A dir and a file can have the same name in same path |
Anonymous file / dir names are not allowed on swift paths |
Filenames without extension and dirnames with ext are allowed. A dir and a file can have the same name in same path |
Duplicates |
Not allowed |
Not allowed |
Duplicate filenames are allowed. Duplicate dir names are not allowed |
list |
Recurisvely lists the files and empty directories |
Recurisvely lists the files and empty directory markers. |
Recursively lists the files within DX folder path. No empty directories are listed. |
list with prefix |
Treated as prefix to absolute path |
Treated as prefix to absolute path |
Treated as path to a subfolder to list |
list on filepath |
Returns [filepath] |
Returns [filepath] |
Returns [] |
listdir on non-existent folder |
Returns [] |
Returns [] |
Raises NotFoundError |
copy, target exists and is file |
Overwrites the file |
Overwrites the file |
Deletes the existing file before copying over |
copy, target exists and is dir |
Not allowed |
Not possible |
Copied within existing dir |
serverside copy/copytree |
Allowed |
Not allowed |
Allowed |
copytree, target exists as file |
Not allowed |
Allowed |
Allowed |
copytree, target exists as dir |
Merges the two directories |
Overwrites the dir |
Copies inside existing directory. If root folder is moved, project name is used while copying if needed. |
Files on DNAnexus¶
Files stored on the DX platform are immutable. This is because the files are internally stored in AWS while the metadata handling is taken care of by the platform. Hence, once a file is uploaded, it cannot be modified.
When one file is copied to another project, only additional metadata is produced, while the underlying file on AWS remains the same. This is essential. The same file with the same canonical ID will appear in both projects, and can have different folder paths. Deleting a file from one project is possible, which deletes the metadata and leaves the file untouched in other projects.
DXPath on stor¶
Project is always required for DNAnexus instances:
>>> Path('dx://path/to/file')
Traceback (most recent call last):
...
<exception>
but projects are normalized:
>>> Path('dx://myproject')
DXVirtualPath('dx://myproject:')
Duplicate names on DNAnexus¶
A single virtual path can refer to multiple files (and even a folder) simultaneously! Currently, stor will error if a specific virtual path resolves to multiple files (use the dx-tool in these cases), but you can always use a canonical path.:
$ dx upload myfile.txt -o MyProject:/myfile.txt
$ dx upload anotherfile.txt -o MyProject:/myfile.txt
$ stor cat dx://MyProject:/myfile.txt
# MultipleObjectsSameNameError: Multiple objects found at path (dx://StorTesting:/1.bam). Try using a canonical ID instead
When a folder has the same name as a file, stor uses the method you call to check for
a folder or a file (i.e., DXPath.listdir
will assume folder, DXPath.stat
will assume file).
DXPath¶
-
exception
stor.dx.
DNAnexusError
(message, caught_exception=None)[source]¶ Base class for all remote errors thrown by this DX module
-
class
stor.dx.
DXCanonicalPath
(pth)[source]¶ Represents fully canonicalized DNAnexus paths: ‘dx://project-{dxID}:/file-{dxID}’ or ‘dx://project-{dxID}:’
-
property
canonical_path
¶ Get DXCanonicalPath instance for path
-
property
canonical_project
¶ The canonical dxid for the project
-
property
canonical_resource
¶ The canonical dxID of the file resource
-
exists
()[source]¶ Checks existence of the path.
- Returns
True if the path exists, False otherwise.
- Return type
-
splitpath
()[source]¶ Wrapper around base splitpath function which calls splitpath on the normpath of self
-
virtual_path
¶ The DXVirtualPath instance equivalent to the canonical path within the specified project
-
property
virtual_project
¶ The virtual (human-readable) name of the project associated with this path
-
property
virtual_resource
¶ The virtual (human-readable) path of the resource associated with this path
-
property
-
class
stor.dx.
DXPath
(pth)[source]¶ Provides the ability to manipulate and access resources on DNAnexus servers with stor interfaces.
-
abspath
()¶ No-op for ‘abspath’
-
clear_cached_properties
()[source]¶ Clears all cached properties in DXPath objects.
The canonical and virtual forms of DXPath objects are cached to not hit the server for every transformation call. However, after copy/remove/rename, the cached information is outdated and needs to be cleared.
-
property
content_type
¶ Get content type for DXObject. Returns empty string if not present or is project/
-
copy
(dest, raise_if_same_project=False, **kwargs)[source]¶ Copies data object to destination path.
If dest already exists as a directory on the DX platform, the file is copied underneath dest directory with original name.
If the target destination already exists as a file, it is first deleted before the copy is attempted.
For example, assume the following file hierarchy:
dxProject/ - a/ - - 1.txt anotherDxProject/
Doing a copy of
1.txt
to a new destination ofb.txt
is performed with:Path('dx://dxProject:/a/1.txt').copy('dx://anotherDxProject/b.txt')
The end result for anotherDxProject looks like:
anotherDxProject/ - b.txt
And, if the destination already exists as a directory, i.e. we have:
dxProject/ - a/ - - 1.txt anotherDxProject/ - b.txt/
Performing copy with following command:
Path('dx://dxProject:/a/1.txt').copy('dx://anotherDxProject/b.txt')
Will yield the resulting structure to be:
anotherDxProject/ - b.txt/ - - 1.txt
If the source file and destination belong to the same project, the files are moved instead of copied, if the raise_if_same_project flag is False; because the same underlying file cannot appear in two locations in the same project.
If the final destination for the file already is an existing file, that file is deleted before the file is copied.
- Parameters
dest (Path|str) – The destination file or directory.
raise_if_same_project (bool, default False) – Controls moving file within project instead of cloning. If True, raises an error to prevent this move. Only takes effect when both source and destination are within the same DX Project
- Raises
DNAnexusError – When copying within same project with raise_if_same_project=False
NotFoundError – When the source file path doesn’t exist
-
copytree
(dest, raise_if_same_project=False, **kwargs)[source]¶ Copies a source directory to a destination directory. This is not an atomic operation.
If the destination path already exists as a directory, the source tree including the root folder is copied over as a subfolder of the destination.
If the source and destination directories belong to the same project, the tree is moved instead of copied. Also, in such cases, the root folder of the project cannot be the source path. Please listdir the root folder and copy/copytree individual items if needed.
For example, assume the following file hierarchy:
project1/ - b/ - - 1.txt project2/
Doing a copytree from
project1:/b/
to a new dx destination ofproject2:/c
is performed with:Path('dx://project1:/b').copytree('dx://project2:/c')
The end result for project2 looks like:
project2/ - c/ - - 1.txt
If the destination path directory already exists, the folder is copied as a subfolder of the destination. If this new destination also exists, a TargetExistsError is raised.
If the source is a root folder, and is cloned to an existing destination directory or if the destination is also a root folder, the tree is moved under project name.
Refer to
dx
docs for detailed information.- Parameters
dest (Path|str) – The directory to copy to. Must not exist if its a posix directory
raise_if_same_project (bool, default False) – Allows moving files within project instead of cloning. If True, raises an error to prevent moving the directory. Only takes effect when both source and destination directory are within the same DX Project
- Raises
DNAnexusError – Attempt to clone within same project and raise_if_same_project=True
TargetExistsError – All possible destinations for source directory already exist
NotFoundError – source directory path doesn’t exist
-
dirname
()[source]¶ Returns directory name of path. Returns self if path is a project.
To avoid making API calls, canonical paths will return the project ID
-
download
(dest, **kwargs)[source]¶ Download a directory.
- Parameters
dest (Path) – The output directory
- Raises
NotFoundError – When source or dest path is not a directory
-
download_object
(dest, **kwargs)[source]¶ Download a single path or object to file.
- Parameters
dest (Path) – The output file
- Raises
NotFoundError – When source path is not an existing file
-
download_objects
(dest, objects)[source]¶ Downloads a list of objects to a destination folder.
Note that this method takes a list of complete relative or absolute OBSPaths to objects (in contrast to taking a prefix). If any object does not exist, the call will fail with partially downloaded objects residing in the destination path.
- Parameters
dest (str) – The destination folder to download to. The directory will be created if it doesnt exist.
objects (List[str|PosixPath|SwiftPath]) – The list of objects to download. The objects can be paths relative to the download path or absolute obs paths. Any absolute obs path must be children of the download path
- Returns
- A mapping of all requested
objs
to their location on disk
- A mapping of all requested
- Return type
Examples
To download a objects to a
dest/folder
destination:from stor import Path p = Path('dx://project:/dir/') results = p.download_objects('dest/folder', ['subdir/f1.txt', 'subdir/f2.txt']) print results { 'subdir/f1.txt': 'dest/folder/subdir/f1.txt', 'subdir/f2.txt': 'dest/folder/subdir/f2.txt' }
To download full obs paths relative to a download path:
from stor import Path p = Path('dx://project:/dir/') results = p.download_objects('dest/folder', [ 'dx://project:/dir/subdir/f1.txt', 'dx://project:/dir/subdir/f2.txt' ]) print results { 'dx://project:/dir/subdir/f1.txt': 'dest/folder/subdir/f1.txt', 'dx://project:/dir/subdir/f2.txt': 'dest/folder/subdir/f2.txt' }
-
exists
()[source]¶ Checks whether path exists on local filesystem or on swift.
For directories on swift, checks whether directory sentinel exists or at least one subdirectory exists
-
expanduser
()¶ No-op for ‘expanduser’
-
glob
(pattern, condition=None, canonicalize=False)[source]¶ Glob for pattern relative to this directory.
-
isdir
()[source]¶ Determine if path is directory-like (i.e., it’s a project, or it’s a folder that can be listed)
- Returns
True if path is an existing folder path or project
- Return type
-
isfile
()[source]¶ Determine an object exists at the specified path
- Returns
True if path points to an existing file
- Return type
-
joinpath
(*others)[source]¶ Wrapper around base joinpath function which converts the first part to normpath before joining with others.
-
list
(canonicalize=False, starts_with=None, limit=None, classname=None, condition=None)[source]¶ List contents using the resource of the path as a prefix. This will only list the file resources (and not empty directories like other OBS).
Warning
Prefer
list_iter()
to this method in production code. If there are many files (i.e., more than 1-2K) to list, this method may take a long time to return and use a lot of memory to construct all of the objects.Examples
>>> Path('dx://MyProject:/my/path/').list(canonicalize=False) [Path('dx://MyProject:/my/path/to/file.txt, ...] >>> Path('dx://MyProject:/my/path/').list(canonicalize=True) [Path('dx://project-123:file-123'), ...]
- Parameters
canonicalize (bool, default False) – if True, return canonical paths
starts_with (str) – Allows for an additional search path to be appended to the resource of the dx path. Note that this resource path is treated as a directory
limit (int) – Limit the amount of results returned
classname (str) – Restricting class : One of ‘record’, ‘file’, ‘gtable, ‘applet’, ‘workflow’
condition (function(results) -> bool) – The method will only return when the results matches the condition.
- Returns
Iterates over listed files that match an optional pattern.
- Return type
List[DXPath]
-
list_iter
(canonicalize=False, starts_with=None, limit=None, classname=None)[source]¶ Iterable that yields objects under prefix (especially useful when a folder may have many small files)
Note that this is a wrapper function to walkfiles.
- Parameters
canonicalize (bool, default False) – if True, return canonical paths
starts_with (str) – Allows for an additional search path to be appended to the resource of the dx path. Note that this resource path is treated as a directory
limit (int) – Limit the amount of results returned
classname (str) – Restricting class : One of ‘record’, ‘file’, ‘gtable, ‘applet’, ‘workflow’
- Returns
Iterates over listed files that match an optional pattern.
- Return type
Iterable[DXPath]
-
listdir
(only='all', canonicalize=False)[source]¶ List the path as a dir, returning top-level directories and files.
- Parameters
- Returns
Iterates over listed files directly within the resource
- Return type
List[DXPath]
- Raises
NotFoundError – When resource folder is not present on DX platform
-
listdir_iter
(canonicalize=False)[source]¶ Iterate the path as a dir, returning top-level directories and files.
-
makedirs_p
(mode=511)[source]¶ Make directories, including parents on DX from DX folder paths.
- Parameters
mode – unused, present for compatibility (access permissions are managed at project level)
-
property
name
¶ File or folder name of the path. Empty string for projects or folders with trailing slash.
Makes no API calls to server, canonical paths are treated normally, and the basename of the path is returned.
-
open
(mode='r', encoding=None)[source]¶ Opens a OBSFile that can be read or written to and is uploaded to the remote service.
For examples of reading and writing opened objects, view OBSFile.
- Parameters
- Returns
The file object for Swift/S3/DX.
- Return type
OBSFile
- Raises
ValueError – if attempting to write to project
DNAnexusError – A dxpy client error occured.
-
property
project
¶ The project name from the path or None
-
read_object
()[source]¶ Reads an individual object from DX. Note dxpy for Py3 automatically decodes the DXFile.read using utf-8.
- Returns
the raw bytes from the object on DX.
- Return type
-
realpath
()¶ No-op for ‘realpath’
-
remove
()[source]¶ Removes a single object from DX platform
- Raises
ValueError – The path is invalid.
-
property
resource
¶ The virtual or canonical path to the file within the project (as a POSIXPath).
Examples
>>> Path('dx://project:dir/file').resource PosixPath('dir/file') >>> Path('dx://project-123:file-456').resource PosixPath('file-456')
NOTE: to avoid making API requests, this operation only uses the local string
-
rmtree
()[source]¶ Removes a resource and all of its contents. The path should point to a project or directory.
- Raises
NotFoundError – The path points to a nonexistent directory
-
stat
()[source]¶ Performs a stat on the path. This method follows (slightly vague) behavior of dxpy’s describe method. It works as expected for a virtual path. However, for a canonical path:
Path(‘dx://project-123:/file-123’)
say project-123 exists and file-123 exists, but file-123 doesn’t exist inside project-123, stat will still return the describe response on file-123 (with its default project).
Use stor.exists to check if a canonical path actually exists.
- Raises
MultipleObjectsSameNameError – If project or resource is not unique
NotFoundError – When the project or resource cannot be found
ValueError – If path is folder path
-
temp_url
(lifetime=300, filename=None)[source]¶ Obtains a temporary URL to a DNAnexus data-object.
If
DX_FILE_PROXY_URL
or[dx] file_proxy_url=
is set, will use that to construct a path instead, e.g.:>>> stor.Path('dx://proj:/folder/mypath.csv').temp_url() 'https://dl.dnanex.us/F/D/awe1323/mypath.csv' >>> with stor.settings.use({'dx': {'file_proxy_url': ... 'https://my-dnax-proxy.example.com/gateway'}): ... stor.Path('dx://proj:/folder/mypath.csv').temp_url() 'https://my-dnax-proxy.example.com/gateway/proj/folder/mypath.csv'
The file proxy is assumed to be a service that, when given DX path and project, will proxy through to DNAnexus to render content.
- Parameters
- Raises
ValueError – The path points to a project
ValueError –
file_proxy_url
is set andfilename
does not match object nameValueError –
file_proxy_url
does not look like a valid http(s) pathNotFoundError – The path could not be resolved to a file (when
file_proxy_url
unset)
-
upload
(to_upload, **kwargs)[source]¶ Upload a list of files and directories to a directory.
This is not a batch level operation. If some file errors, the files uploaded before will remain present.
- Parameters
to_upload (List[Union[str, OBSUploadObject]]) – A list of posix file names, directory names, or OBSUploadObject objects to upload.
- Raises
ValueError – When source path is not a directory
TargetExistsError – When destination directory already exists
-
walkfiles
(pattern=None, canonicalize=False, recurse=True, starts_with=None, limit=None, classname=None)[source]¶ Iterates over listed files that match an optional pattern.
- Parameters
pattern (str) – glob pattern to match the filenames against.
canonicalize (bool, default False) – if True, return canonical paths
recurse (bool, default True) – if True, look in subfolders of folder as well
starts_with (str) – Allows for an additional search path to be appended to the resource of the dx path. Note that this resource path is treated as a directory
limit (int) – Limit the amount of results returned
classname (str) – Restricting class : One of ‘record’, ‘file’, ‘gtable, ‘applet’, ‘workflow’
- Returns
Iterates over listed files that match an optional pattern.
- Return type
Iter[DXPath]
-
write_object
(content, **kwargs)[source]¶ Writes an individual object to DX.
Note that this method writes the provided content to a temporary file before uploading. This allows us to reuse code from DXPath’s uploader (multi part object support, etc.).
- Parameters
content (bytes) – raw bytes to write to OBS
**kwargs – Keyword arguments to pass to
DXPath.upload
-
-
class
stor.dx.
DXVirtualPath
(pth)[source]¶ Class Handler for DXPath of form ‘dx://MyProject:/a/b/c’ or ‘dx://project-{uuid}:/b/c’
-
property
canonical_path
¶ The unique file or project that matches the given path
-
canonical_project
¶ The dxid of the unique project for the given project name. Only resolves project user has access to.
- Raises
MultipleObjectsSameNameError – If project name is not unique on DX platform
NotFoundError – If project name doesn’t exist on DNAnexus
-
canonical_resource
¶ The dxid of the file at this path
- Raises
MultipleObjectsSameNameError – if filename is not unique
NotFoundError – if resource is not found on DX platform
ValueError – if path looks like a folder path (i.e., ends with trailing slash)
-
exists
()[source]¶ Checks existence of the path.
- Returns
True if the path exists, False otherwise.
- Return type
-
splitpath
()[source]¶ Wrapper around base splitpath function which calls splitpath on the normpath of self
-
property
virtual_path
¶ Path as DXVirtualPath
-
virtual_project
¶ Returns the virtual name of the project associated with the DXVirtualPath
-
property
virtual_resource
¶ Human-readable path to the object in its DNAnexus Project (as PosixPath)
-
property
-
exception
stor.dx.
InconsistentUploadDownloadError
(message, caught_exception=None)[source]¶ Thrown during checksum mismatch or part length mismatch..
Copy and copytree by example¶
copy and copytree behave differently when the target output path exists and is a folder. (this holds DX -> DX and also POSIX -> DX)
command |
output path (no folder) |
output path (folder exists) |
---|---|---|
stor cp myfile.vcf dx://project2:/call |
dx://project2:/call |
dx://project2:/call/myfile.vcf |
stor cp -r ./trip-photos dx://newproject:/all |
dx://newproject:/all |
dx://newproject/all/trip-photos |
Note that if the output path exists and is a file, the file will be overwritten
List, Listdir and walkfiles¶
stor ls
, stor list
and stor walkfiles
for DXPaths take in a --canonicalize
flag which returns the results with canonical dxIDs instead of human readable virtual
paths. This is especially useful for manipulating paths directly using dx-toolkit
through piping. This flag is ignored when passed for other paths (swift/s3/posix).
Open on stor¶
The open
functionality in dx works by returning an instance of
stor.obs.OBSFile
like with other OBS paths(Swift/S3). Although the python
package of DNAnexus dxpy
also has an open functionality on their DXFile,
this is not carried over to stor. One of the main reasons to do this is to wrap
the scope of dxfile.open to what is expected of stor. As an example, dxpy
’s
version of DXFile.open
does not have readline
and readlines
methods
for reading the file. On the other hand, dxpy
does support an ‘append’ mode
to their DXFile.open
which can be confusing to a stor user, because there
are very restricted scenarios this can be used in, and the user would have to
know the different internal states of a file on the DNAnexus platform,
what they mean, when they happen, what operations are allowed on them, etc.
By instantiating stor.obs.OBSFile
for open
, we maintain the
support that is standard by stor, without any real decrease in functionality.
The dx variable exposed as a setting ‘wait_on_close’, has a default value of 0 (seconds). This variable
determines how long an stor.open
action waits for the file to go to ‘closed’ state on DNAnexus.
If a file is not in the ‘closed’ state internally on the platform, it cannot be read from, or
downloaded. If you need consistency from reading right after writing, then you should set
wait_on_close to be a value > 0. The default is kept so that in the event of multiple uploads,
each upload doesn’t wait for wait_on_close seconds before initiating the next upload. However,
setting wait_on_close > 0 can cause unexpected performance issues, depending on the performance of
the DNAnexus platform, so you need to know what you’re doing when changing this.