Release notes (Last modified jmc:2005-07-29).
Note for SCOP 1.69: A small number of PDB entries with 'official' release dates towards the end of September are not included because they were only released in the first weekly PDB update of October 2004. These entries are 1up9, 1upd, 1urm, 1w0c, 1w0h, 1w1w, 1w28, 1w2a, 1w2n, 1w2o, 1w2p, 1w2v, 1w30, 1w32, 1w3h, 1w7o, 1w7p, 1w8l, and 1w8m.
These notes mainly cover new features introduced in SCOP 1.55 and subsequent releases. They are organized by topics for easy reference, and include detailed explanations and examples. Although release 1.55 and following ones are superficially similar to previous releases, there are several innovations that may affect the way you use SCOP [4]. The SCOP history, originally designed to document fully all modifications of the classification, has been extended to provide an easy way to browse the subset of the classification affected by revisions occurred since 1.55 [4], including major rearrangements of the classification that have been carried out in release 1.63 and 1.65. Please cite [6] for the SCOP reclassification itself.
Please read carefully and don't hesitate to get in touch with us, should you have any further question. Your comments and suggestions are more than welcome.
The
SCOP (Structural Classification of Protein) database
is a comprehensive ordering of all proteins of known structure according to
their evolutionary, functional and structural relationships.
The basic classification unit is the protein domain.
Domains are
hierarchically classified into
species, proteins,
families, superfamilies, folds,
and classes whose meaning is described in the
original articles [1,2,3].
As an example, see
the classification of the bacteriophage 434 protein,
PDB entry
2cro.
The set of buttons at the top of each web page
provides help and an easy way of navigating through the hierarchy.
Protein domains can
be visualized using either RASMOL
Note that only the
first seven classes
are true classes. The remaining ones are place holders
for PDB entries that is useful to keep together,
but are not part of the SCOP classification.
Also, one of the classes
includes multi-domain proteins
only,
that is, proteins not yet split into domains and such that
folds of the individual domains belong to more than one class. Multi-domain
proteins with individual domain folds belonging to the same class can be
found under the corresponding fold and class, mixed with single domain
proteins.
In computational terms, although SCOP is essentially
a hierarchy,
a superposed
mechanism for cross-linking between nodes of the tree makes it a more
general form of graph.
This allows the representation of biological relationships more complex
than the
parent-child relationships in a tree.
At the moment, this cross-linking
ability is only used to group together domains
belonging to
the same PDB
file but classified separately
(see the
Moreover, since the original design and
implementation of SCOP [3]
is based on a description of the underlying data structure rather than
the data structure itself, it would be easy to introduce new classification levels
everywhere, such a suprafamily between family and superfamily,
for example. Once the description is modified accordingly,
the rest of it will fall into place automatically. Extensions of
SCOP were designed with
this possibility in mind.
Keeping in mind the underlying, deep structure of SCOP
while reading these notes may help you understand
its surface representation both in terms of web pages and
parseable files,
as well as possible future extensions.
Access to the knowledge encoded in SCOP
has always been a major concern.
That's why all the information is also
available in compact text files
that are easy to parse,
making large scale analysis of the SCOP content
straightforward. A set of
unique identifiers associated to
each SCOP entry is used both for
web pages and as keys in these
parseable files.
Unique identifiers remain stable across releases and provide an
unambigous way of
searching SCOP,
linking to and
from SCOP, tracking
the history of any node in the hierarchy, and
referring to SCOP entries in
related
research and in the literature.
Stable
identifiers, parseable files, extended linking and searching options,
and the concept of a history tracking the evolution
of the SCOP classification itself were
introduced in release 1.55 and have been developed since then
to accommodate structural genomics requirements as well as
other usages of SCOP for both research and educational
purposes [4].
SCOP genetic domain sequences and coordinates,
mapping between sequences from SEQRES and ATOM records
in PDB files and other
SCOP
related data can be downloaded from
the associated
ASTRAL
web site and were first described in [5].
SCOP is undergoing continual evolution.
Older
SCOP releases remain accessible, allowing
redirection for entries that become obsolete over time. In this way,
no information is ever lost and it is always possible
to go back to the original data
used in SCOP related research even after domain
definitions or the classification itself has changed.
Major rearrangements of the classification have actually
occurred in release 1.63 and 1.65 and more are expected in the near
future. So far, the rearrangement involved about 11% of the domain entries,
as described in [6].
The subset of the classification affected by these changes
can be browsed in a
SCOP-like form
or
CHIME
.
Click on the
icon for a set of links to
other sites containing useful information related to the
protein domain. Use the entry at the bottom of each page
to search the SCOP
database.
icon),
or to indicate connections between
SCOP entries,
as in the case of
the circular
permutation that is likely to relate the threonyl-tRNA
synthetase superfamily and LuxS family. However, the same
mechanism can be easily extended to add
new, possibly sparse, intermediate levels in the hierarchy.
Starting with release 1.55,
there are two series of new identifiers, SCOP
unique identifiers (sunid) and
SCOP concise classification strings (sccs).
They are stable across releases,
with the caveat that
a domain definition or the classification itself can change.
Nodes in the hierarchy can merge or split as more evidence about
evolutionary relationships becomes
available from experimental data. Corresponding identifiers
become obsolete. No identifier is ever reused, and all modifications
of the classification are documented and easily tracked
(see SCOP history).
Therefore, stable means that the same identifier will always
refer to the same object in SCOP, but it does
not mean that the SCOP entries themselves will
not change.
A sunid
is simply a number which uniquely identifies each entry in
the SCOP hierarchy, including leaves (i.e., the
SCOP domains) and entries corresponding to the
protein level (for which there was no explicit reference before).
A sccs is a compact representation of
a SCOP domain classification.
A sccs identifier
includes only the class (alphabetical), fold,
superfamily,
and family (all numerical) to which each domain belongs to.
Together, sunid and sccs replace the old
classification page numbers (like 1.002.044.001.002.021).
All these page names have been renamed in SCOP 1.55.
If you still link to SCOP using a page name,
despite this having been deprecated since 1994, all
your links are broken. This eliminates the even worse
situation of misleading links where a reference using a classification
page number points to a completely different entry as soon as
SCOP is
updated because most of these page names change with every
new release. (For those interested, page names are generated via
enumeration of the SCOP tree and therefore
change as soon as new nodes are inserted, and old nodes are deleted, merged,
split or moved to a different location in the tree.)
sunid and, when applicable, sccs, can be visualized
by placing the mouse on the appropriate link (i.e., either
in the lineage or in the PDB entry section
in a SCOP web page). Look at the bar at the
bottom of your browser.
Starting with release 1.65, sunid also appear explictly in the
web pages.
Not only proteins evolve. Their classification evolves as well, as more evidence
about evolutionary relationships become available from experimental data, or
simply as the result of refinements, rethinking and bug fixing. This process
is a natural one and it would be a mistake to freeze the
SCOP
content, as it is a mistake to freeze any form of knowledge and pretend it
to be absolute and unquestionable.
On the other hand, data accuracy and data reliability are a must, especially for well
established collections like SCOP, which are widely
used to build upon and derive further research results.
Dynamic databases, and the best way of maintaining a history of
their content are open research issues in the database field. In science,
it is a basic tenet that results should be reproducible, which implies that the
original data should remain accessible.
These issues are addressed in SCOP at several levels
and along different dimensions in order to satisfy different needs.
First of all,
old
SCOP releases remain available
online in their entirety, in their original form and content, possibly with
some additions that are backpropagated once introduced in
more recent releases, like in
the case of the
reclassified entries history, but without deletions or changes.
This is
possible without exponentially growing resources thanks to a minimalist
design that results in maximum information content with minimum waste
of space.
Moreover, SCOP parseable
files, also available
online for current and previous releases,
provide a very compact summary of all the information encoded in
SCOP.
Because old releases are available on the web, they can be searched in the
same way as if they were the current release. Also, any reference in
any release to an
identifier that becomes obsolete for various reasons at a certain point
in time remains valid: the search engine
returns a pointer to the page(s) in the old release where it last
appeared, together with links to what it became. See, for example,
what happens when searching for 46781, the
TnsA endonuclease, C-terminal domain
superfamily in SCOP 1.55. This
historical information is not only accessible interactively and for single entries,
but it
is also
summarized in a list that can be
downloaded from the SCOP web site.
In order to talk about what an entry was and what it becomes, there must
be a unique, unambiguous way of referring to it. This function is fulfilled
by sunid, the SCOP unique
identifiers introduced in release 1.55 for this and other
purposes. Nothing else would possibly work, as names in
SCOP are not unique and may change even
if the corresponding node in the hierarchy remains the same.
As emphasized more
than once in these notes, sunid are kept stable
across releases. This means that if a node does not change, its sunid
remains the same but, if the node does change, then its sunid changes,
and the way in which it changes is explicitly documented.
Note that additions of new entries and deletion of obsolete ones do not
constitute a problem. The former will get a new sunid, the sunid
associated with the latter will simply become obsolete. Problems arise when an old
entry in SCOP is modified without becoming obsolete, and
new entries appear not because of the classification of new structures,
but as a consequence of splitting or merging nodes that are already part
of the hierarchy.
A typical example is when a domain in a multidomain protein already classified in
SCOP is observed for the first time by itself, or in a different
context, and therefore qualifies as a
separate domain. The newly split domains can be classified in different parts of
the hierarchy and, in principle, all their attributes,
from species to class, may change. A more apparent case is when a full class,
in SCOP for convenience, but marked as Not a true
class is rearranged and becomes part of the classification, like the
Membrane and cell surface proteins and peptides
class, in which a single fold was replaced
by 20 or more new folds in release 1.63 and several other folds were
amply restructured.
Minor rearrangements of the classification are common. They occur at any new release
and are documented through lists and redirections, as explained before. However,
in SCOP 1.63 and 1.65, about 11% of the total number of
already classified domains changed in terms of their definition, their
classification, or both. These changes are described in
[6]. More are expected in future releases.
Therefore,
we introduced a SCOP-like way of browsing the subset
of the classification that is modified in between releases, to help users
analyze the effects of this reclassification process in a familiar way
[4].
This is now
available online
for all releases, starting with 1.55.
When fully expanded, it gives a detailed overview of all entries in the
classification affected by the revision. See the
changes that occurred between release 1.55 and 1.57, for example.
In the reclassified entries history,
changes appear as comments
associated to domain entries, with links to sunid
in the corresponding revised classification, so that both the previous
classification context for an entry and
the new one can be easily browsed at the same time.
Those SCOP features
that are meaningful for the reclassified entries history are available.
For example, you can expand and compress the history using the navigation
buttons at the top of the page, look at protein domain structures using
RASMOL
There are two ways of searching SCOP: a simple
keyword search
using the search entry at the bottom of any of the
web pages, and a more sophisticated way using the
search engine
from the location bar.
The latter is at the same time a way of linking
to SCOP.
A separate search is also possible for
reclassified entries
that you can browse in a SCOP-like form.
Most of the options described
below are available in that context as well, but search results are restricted to
the subset of entries for which the domain definition
changed or that have been
reclassified. See SCOP history for
more details.
Starting with release 1.65, there is a new option (see the
search engine) to use
MSDlite to search text fields in
PDB files and return links to the corresponding
SCOP entries (2004-04-07).
Any of the SCOP identifiers
(sunid,
sid,
sccs)
can be used in the keyword search.
sunid returns the corresponding page in the SCOP
hierarchy for any node in the tree. Being the primary key for both
web pages and SCOP
parseable files, sunid are
by far the fastest way of retrieving a SCOP entry or linking
to it. They are also stable across releases.
Starting with
release 1.65, they appear explicitly in the web pages (in square brackets),
next to the entry that they
encode.
Please use
a sunid to search or link to
SCOP any time you can. They are the most reliable
and general way of referring to any entry in SCOP.
sid are maintained for backward compatibility. Searching using a sid
as key returns the page corresponding to that domain.
sccs can be right truncated at any
level:
searching with a.1.1.1 returns the page corresponding to the
truncated hemoglobin family,
searching with a.1.1 returns the page corresponding to the
globin-like superfamily,
searching with a.1 returns the page corresponding to the
globin-like fold, and
searching with a as key returns the page corresponding to the
all alpha proteins
class.
PDB identifiers can also be used as keywords
to search SCOP. If a PDB
file content is split into several SCOP domains,
a list of links to all domains is returned, as in the case of
1dan. If a
PDB file becomes obsolete and is superseded
by another one, the search engine provides a pointer to the latter.
See the results of searching for 1hs0.
You can also use E.C. numbers as keywords.
For example,
3.1.21.4
returns the list of restriction endonucleases enzymes
classified in SCOP for which a mapping to
E.C. numbers can be derived from PDB files.
A more sofisticated mapping, on a per-residue basis,
is also available for individual domains (see
linking from SCOP).
Finally, you can search using any (at least two character long) word
that appears in any of the SCOP
pages. fivefold, for example, returns the page corresponding
to the five-fold pentein fold. Note that fivefold is the
spelling of the compound adjective in the SCOP
comment, but words don't necessarily need to be complete.
They can be right truncated by appending a + at the end,
like in hypoth+
Two or more words can be combined using
+(and) and -(and not) word-prefix
operators. Try structural +genomic+. This returns the list
of pages in which both structural and (any completion of)
genomic appear
in comments. Conversely, structural -genomic+ returns
the list of pages containing structural, but not (any completion
of) the word genomic.
Right truncation allows you to enter fewer characters per search.
On the other hand, operators can be used to focus your search:
yeast +saccharom+ returns the list of yeast proteins for
Saccharomyces cerevisiae only, while yeast +saccharom+ +elongation
returns a list further restricted to elongation factor domains from
saccharomyces cerevisiae. Compare with the results of searching
for yeast only.
Useful keywords are
newfa,
newsf,
newcf,
newcl, first
introduced in release 1.59.
They return, respectively, the list of new families, superfamilies,
folds, or classes for the current SCOP release and are part
of the history facility. New here refers either to
a new node
in SCOP that did not exist in previous releases, or
a node whose content has been modified not simply by addition of new entries
or removal of obsolete ones,
and therefore has acquired a new meaning
(see SCOP history).
Note that keyword search is case-insensitive. This implies that
PDB chain identifiers are treated uniformly,
without distinction between upper- and
lower-case. Very few PDB
files that include both lower- and uppercase chain identifiers
cannot be accommodated in the SCOP classification.
Names and comments in
SCOP are a sort of primitive form of annotation and
help produce biologically meaningful search results. However, please keep
in mind that names are
not unique, even within the same release.
It is not uncommon for the same name to be used for, say,
both the family and superfamily of a set of domains. Also, names are
reused within the same level of the hierarchy, expecially for proteins.
Different proteins, classified in different ways, may nevertheless
be called
in the same way.
All keyword searches are local
to the current release.
Previous SCOP
releases can be searched using the search engine for the corresponding
old release
that we keep
online as part of the SCOP
history facility. In some cases, you can also use the search engine
for the current release to search an old one (see
Advanced search options).
Moreover, a search for an identifier
that is now obsolete automatically refers you both to the release
in which it was last used,
and to the new identifer(s) that replaced it. Follow the fate
of a.4.4, for example.
Search options are also available for the
reclassified entries history, though the search
itself is limited to the history content
and keywords include SCOP sunid,
PDB identifiers, and words that appear in web history
pages only.
Starting with release 1.65, if the search for a PDB
identifier fails because it is not yet classified in
SCOP, there is an option to use
SSM
to match the PDB file against the
set of SCOP domains. See what happens if you
search for
1f2y.
Information in
SCOP
is interactively accessible
through a search engine
available at
http://scop.mrc-lmb.cam.ac.uk/scop
or any of the several mirrors scattered
around the world.
In the following explanation and examples,
simply replace scop.mrc-lmb.cam.ac.uk/scop
with your favourite SCOP server.
Please also read the
previous keyword search section,
as most of what is described there
applies here as well and will not be repeated.
By explicitly using the search engine from the location bar,
you can perform most of the keyword search operations
and obtain the same results. Moreover, you can specify options that
modify these results.
Most of the examples in this section are also
a valid way of linking to SCOP.
The general way to search or link to any of the SCOP
entries is by using the corresponding sunid:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sunid=sunid
returns the corresponding page, or a page indicating that
sunid is obsolete, with links to the replacing one(s).
Try:
You can also search or link to SCOP using a (possibly
right-truncated) sccs, as in:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sccs=a.1.1.1
Searching and linking to scop using a
sid is still accepted for backward compatibility:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sid=d2cro__
Searching using a PDB identifier returns
the corresponding page:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?pdb=2cro
You can also search SCOP using E.C.
numbers:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?ec=3.1.21.4
It is possible to pass parameters to the search
engine. For example, you can specify for which release
you would like the list of, say, new families introduced in that
release. Default is the current release. Compare the
results of:
Also useful is the lev option. This allows you to specify
the level in the SCOP hierarchy in which you are
interested, independently of the level of your initial query. For example,
you can search or link to the superfamily page for a given domain by
specifying lev=sf in your query:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sid=d2cro__;lev=sf.
By adding a lev option to a search using
E.C. numbers, like in:
Valid levels are cl (class), cf (fold),
sf (superfamily), fa (family), dm (protein),
sp (species), and px (domain).
Finally, you can use the search engine to retrieve the
SCOP parseable
files (for current release only):
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?dir=yyy
where
yyy is either cla, des, hie,
com (or, for old releases, dom).
All parseable files, for current and previous releases, are
available online
and can be retrieved using:
http://scop.mrc-lmb.cam.ac.uk/parse/filename_x.xx
where filename stands for
dir.cla.scop.txt, or
dir.des.scop.txt, or
dir.hie.scop.txt, or
dir.com.scop.txt, and
x.xx is the release number (e.g., 1.65).
The general way to link to any of the SCOP
entries is by using the corresponding sunid:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sunid=sunid
returns the corresponding page, or a page indicating that
sunid is obsolete, with links to the replacing one(s).
Try:
Please use the sunid any time you can and this is appropriate for
your purposes. sunid are primary keys in the SCOP
database and are kept
stable across releases. Therefore, they are the fastest
and most reliable way of referring to a SCOP entry
at any level of the hierarchy, from leaves (e.g., domains) to classes.
All ways described in the advanced search options
section are also valid ways of linking to different kinds of information
available in SCOP, and allowing retrieval of this information.
Please keep in mind that in some cases
the link result may not be a specific web page, but a page of links to several
web pages, or a warning page telling you that an entry is obsolete and
redirecting you to its replacement(s).
The
See d1ckqa_ [sunid=33229],
for example. Links are logically arranged, starting from sequence and coordinate
data for the domain and expanding to genomic assignments and search for similar
structures both in SCOP and the up-to-date
PDB, which includes files not yet classified.
Links also point to structural characterization of the domain, as well as
further information about sequence and function, when available.
Some of the related information is actually at the level
of PDB file rather than the
SCOP domain. It is grouped separately, and provides useful
summaries from well established web sites.
Most of these sites are synchronously updated with SCOP,
shortly after any new release. Some other sites are updated more frequently
than SCOP. Sites updated asynchrously are grouped
together at the end of the list. Links to these sites appear only if
there is a corresponding entry at the other end, to avoid useless and therefore
disappointing trips over the net.
All these links can be dynamically updated between SCOP
releases, as the information becomes available.
Since a link has two ends by definition,
none of these links would be possible
without the much appreciated
collaboration of the people who are taking care of these sites and
provide curated mapping between their data and SCOP.
For each of the links, credit is given by providing a pointer to the original sources.
The SCOP web site provides an easy way of interactively
exploring the classification. However, this is not the most useful way of
processing the SCOP content for large scale analysis
projects, or when you want to use a computer program instead of your eyes
and brains.
Access to the knowledge encoded in SCOP
has always been a major concern. That's why this knowledge is also
available in compact text files that are easy to parse.
Starting with release 1.55,
there are three new parseable files. Together, they replace and
extend the now obsolete
dir.dom.scop.txt and dir.lin.scop.txt.
Each of these files has a header, starting with the '#' character,
which includes release, version, and copyright information. These
files have been designed in such a way that the likely inclusion
of new levels in the current SCOP hierarchy
will not break your code, provided you parse them correctly,
without making any assumption about the order in which levels appear
in the description. In the example below,
don't expect fa to come next to sf, because we
may insert a new "suprafamily" level in SCOP in
the near future, or some other levels that may help to
model the evolution of proteins better.
- dir.cla.scop.txt
Note that begin-end can be missing, like in d1danh_, meaning that the domain
corresponds to the whole chain. Also, some domains are composed of fragments
from the same or from different chains, like d1lvk_2 and d1dan.1.
In this case, fragments
are separated by commas, and appear in the domain definition in the same
order as in the original gene sequence,
which is not necessarily the same as the order of chains in
the corresponding PDB file. d1mbd__ and d1lvk_2
illustrate the case of a 'blank' PDB chain identifiers,
the former spanning the whole chain, the latter being composed of two
non-contiguous fragments from the same chain. Circular permutations are
not uncommon and generally indicated as such in comments.
For details about chain and residue identifiers,
please refers to the documentation
for the PDB
format available online at:
http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html
The fourth column in dir.cla.scop.txt
is the sccs,
the fifth column the sunid (px) for that domain,
listed separately to ease parsing;
the sixth column is the list of sunid for the domain:
class
(cl), fold (cf), superfamily
(sf), family (fa), protein domain (dm),
species (sp), and domain entry (px). They appear
as pairs (key=sunid) so that order does not need to be taken into account
in parsing the file and therefore
intermediate levels can be added in the future without breaking
code written now.
In all parseable files columns are tab-delimited.
- dir.des.scop.txt
Each entry corresponds to a node in the SCOP
hierarchy. The first
field is the sunid.
The second field is the entry type.
The third
is the sccs
(for domain entries, species and proteins, it is the
sccs for the corresponding family level). The fourth field is
a "short name" for that entry (the scop identifier, or
sid,
in the case of domains, currently an empty place-holder
for other types).
The last
field is the English description for that entry, the same one which
appeared in the now obsolete dir.lin.scop.txt file, with
the exception of domain entries, in which case it is the domain
definition.
Note that names are not unique. sunid are unique. They are the only
way of referring to an entry in SCOP without
ambiguity.
Entries in dir.des.scop.txt and other parseable files
are listed in the same order in which they appear in the SCOP
classification. This is equivalent to explore the classification tree
in depth-first order, always following the leftmost node first.
The order of the entries is important, as often
closely related entries appear next to each other in
SCOP. This useful information is preserved in all
parseable files.
At the domain level only, entries are sorted according to the quality of
the structure, as indicated by the corresponding
AEROSPACI
score. Therefore, the first entry in a list of domains for any given species
in SCOP is the best molecule for that species
in terms of structure quality.
By itself, dir.des.scop.txt contains all the information
about a given SCOP release. It is the most
compact way of representing SCOP and its tree-like
structure allows for very efficient usage and search.
- dir.hie.scop.txt
This file has no precursor in SCOP before release
1.55. It represents
the SCOP hierarchy in terms of
sunid.
The first field is the sunid for a node, the second the
sunid of its parent, and the third the list of sunid
of its children. "0" is the root, which has no parent (-). "21953" is
a scop domain (a leaf) and therefore has no children. "49267" is
a protein-type sunid, it has a parent, and two children.
As for all parseable files, if you properly
parse dir.hie.scop.txt without
making assumptions about the parent of a fa-type
sunid be a sf-type sunid, your code will not
break, should we decide to introduce new intermediate
levels in SCOP.
- dir.com.scop.txt
Starting with release 1.63, we also distribute
comments as they appear in the SCOP pages. Some of them
are automatically generated, some are manually edited. Comments associated
with a given
All parseable files for current and previous releases are available online
at:
http://scop.mrc-lmb.cam.ac.uk/parse
and can be retrieved at:
http://scop.mrc-lmb.cam.ac.uk/parse/filename_x.xx
or using the
SCOP search engine
(for current release only):
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?dir=yyy
filename is the name of the parseable file,
x.xx the release number (e.g., 1.65) and
yyy is either cla, des, hie,
com (or, for old releases, dom).
Relation between old and new identifiers
The old classification page numbers
(like 1.002.044.001.002.021) appear nowhere in recent
SCOP releases.
Within
each SCOP release, there is a one-to-one
correspondence between sid
(like d1dan.1) and
px-type sunid (like 21953).
This one-to-one correspondence is not necessarily maintained across
releases.
While we
can keep the sunid (and sccs) stable
in the sense
defined above, we don't have a direct control over a sid,
which is formed using a PDB filename and a
PDB chain identifier. sid
are more like short informative
names than unique object identifiers.
Please use the new
set of identifiers, especially for computer-based analyses, linking
to SCOP, and referring to
SCOP domains and classification in the
literature. If you do so, it works across releases automatically.
If you don't, any update or rearrangment of the classification
will result in wrong references, wrong links, and so on.
Rapid access format file (RAF)
A manually curated mapping
between
SEQRES and ATOM fields
for all PDB chains in SCOP
is available at the
associated ASTRAL
web site. This provides a SCOP user
with the official
definition of a domain in terms of:
a) SEQRES sequence; b) ATOM
sequence; c) mapping between the two sequences; and
d) the original PDB residue identifiers from the
ATOM field (these latter uniquely define the
domain in terms of PDB coordinates).
Given the nature of old and new
PDB files, it is basically
impossible to parse them or use any other derived data in a way that
uniquely identifies the set of residues belonging to a chain. The
RAF file is meant to overcome this difficulty
by providing a compact reference to the original PDB
file for
SCOP domains.
Although in some cases this precision
may seem unnecessary, in other cases it is indispensable because
individual residues
do matter, as in multiple sequence and structural alignments or
detailed analyses of protein structural features and functionally important
sites.
PDB chains are not necessarily biologically
meaningful units.
A SCOP domain corresponds to an evolutionary unit and
may include non-contiguous fragments from
the same or from different PDB chains. See
21953 from PDB
entry 1dan for example.
In these cases,
fragments are listed in the order in which they appear
in the original single chain precursor, independently of their
order in the PDB chain or the order of the chains
in the PDB file.
The order of the fragments is manually checked against
sequence databases, and the few cases in which the order
is presumed from a different source are reported in comments. Also
reported in comments are the cases of circular permutations, in which
the C and N terminal domains joined at a certain point, and then the
sequence was broken again, but in a different point.
The set of
SCOP
genetic domain
sequences
is available at the
ASTRAL
web site, together with several other sets of sequences
filtered according different criteria.
Old ASTRAL SCOP sequences,
in which fragments belonging to different chains are in separate
fasta entries, are also available at the same site.
Genetic domain coordinate files
The set of SCOP
domain
coordinate files in PDB-style format is available at the
ASTRAL web site.
As for sequences, the order in which fragments appear
in a domain coordinate set is not necessarily the same as the order in which they
appear in a PDB chain, but rather reflects the order in the
original single chain precursor, i.e., in the gene product.
We would like to encourage you to use these reference
data sets in your research work. This
would make comparison, linking, and integration of
SCOP-based results a trivial
task. The purpose is to develop a common language that we can use
without ambiguities when talking about a
SCOP domain and its
classification. The original PDB
file is the only shared basis among research groups worldwide, and by
explicitly referring to the PDB file via
the RAF file and the datasets derived from it
we hope to reduce the confusion.
Original work:
2) Brenner,S.E., Chothia,C., Hubbard,T.J.P. and Murzin,A. (1996)
Understanding protein structure: using SCOP for fold interpretation.
Methods Enzymol., 266, 635643.
[Medline]
3) Brenner,S.E. (1996) Molecular Propinquity,
PhD dissertation, Cambridge University, Cambridge, UK.
4) Lo Conte L., Brenner S. E., Hubbard T.J.P., Chothia C., Murzin A. (2002).
SCOP database in 2002: refinements
accommodate structural genomics.
Nucl. Acid Res.
30(1), 264-267.
[PDF]
6) Andreeva A., Howorth D., Brenner S.E., Hubbard T.J.P., Chothia C.,
Murzin A.G. (2004).
SCOP database in 2004:
refinements integrate structure and sequence family data.
Nucl.
Acid Res. 32:D226-D229.
[PDF]
or CHIME
,
and
perform searches, though the search itself is limited to the history content
and keywords include SCOP sunid,
PDB identifiers, and words that appear in web history
pages only.
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sunid=47419
and:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sunid=48328.
or:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sccs=a.1.1
or a list of pages, if the content of the PDB file
is split into more than one domain and different domains are
classified separately:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?pdb=1dan
and any word that appears in a web page:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?key=immunoglobulin
including special words like
newfa,
newsf,
newcf,
newcl.
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?key=newfa
and:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?key=newfa;ver=1.61.
Of course, you can obtain the same result by specifying the sunid
for that superfamily:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sunid=47413
but the lev option is maintained for backward compatibility and
because it can be useful if you don't have the sunid at hand.
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?ec=4.2.1.30;lev=sf.
you will get the list of superfamilies for which there is at least
one SCOP domain with an E.C.
number equal to 4.2.1.30. Compare with the results of:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?ec=4.2.1.30.
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sunid=47419
and:
http://scop.mrc-lmb.cam.ac.uk/scop/search.cgi?sunid=48328.
icon
associated with each domain integrates the SCOP classification
for that domain with several other sources of related information. It can be
easily expanded, provided there is a map between a key in SCOP
(preferably the sunid) and the corresponding key(s) on the other side.
It is straightforward to use the same linking mechanism for other intermediate
levels in the SCOP hierarchy, if necessary.
d1dan.1 1dan T:,U:91-106 b.1.2.1 21953 cl=48724,cf=48725,sf=49265,fa=49266,dm=49267,sp=49268,px=21953
d1danu1 1dan U:107-210 b.1.2.1 21954 cl=48724,cf=48725,sf=49265,fa=49266,dm=49267,sp=49268,px=21954
d1danh_ 1dan H: b.47.1.2 26292 cl=48724,cf=50493,sf=50494,fa=50514,dm=50550,sp=50551,px=26292
d1danl1 1dan L:49-86 g.3.11.1 44209 cl=56992,cf=57015,sf=57196,fa=57197,dm=57201,sp=57202,px=44209
d1danl2 1dan L:87-142 g.3.11.1 44210 cl=56992,cf=57015,sf=57196,fa=57197,dm=57201,sp=57202,px=44210
d1danl3 1dan L:1-48 g.32.1.1 44965 cl=56992,cf=57629,sf=57630,fa=57631,dm=57632,sp=57633,px=44965
d1mbd__ 1mbd - a.1.1.2 15033 cl=46456,cf=46457,sf=46458,fa=46463,dm=46469,sp=46470,px=15033
d1lvk_2 1lvk 2-33,80-759 c.37.1.9 32173 cl=51349,cf=52539,sf=52540,fa=52641,dm=52642,sp=52645,px=32173
dir.cla.scop.txt replaces and extends dir.dom.scop.txt.
There is an entry for each SCOP domain.
The first column is the old scop identifier (sid),
the second column
is the PDB
identifier, and the third column is the domain definition,
specified in terms of PDB
chain and begin-end residue identifiers (including insertion code)
from the ATOM field in the corresponding PDB file.
48724 cl b - All beta proteins
48725 cf b.1 - Immunoglobulin-like beta-sandwich
49265 sf b.1.2 - Fibronectin type III
49266 fa b.1.2.1 - Fibronectin type III
49267 dm b.1.2.1 - Extracellular region of human tissue factor
49268 sp b.1.2.1 - Human (Homo sapiens)
21953 px b.1.2.1 d1dan.1 1dan T:,U:91-106
0 - 46456,48724,51349,53931,56572,56835,56992,57942,58117,58231,58788
21953 49268 -
49267 49266 49268,49269
80760 ! CASP5 ! structural genomics protein; complexed with fe, mse
1) Murzin,A., Brenner,S.E., Hubbard,T.J.P. and Chothia,C. (1995)
SCOP:
a Structural Classification of Proteins database for the
investigation of sequences and structures.
J. Mol. Biol. 247, 536-540.
[PDF]
Major additions to SCOP
(stable identifiers,
parseable files,
extended searching
and linking to and
from SCOP options,
reclassified entries history):
Major additions to SCOP
(sequences and coordinate files):
5) Chandonia,J.M., Walker,N.S., Lo Conte,L., Koehl,P.,
Levitt,M. and Brenner,S.E. (2002)
ASTRAL
compendium enhancements.
Nucleic Acids Res., 30, 260263.
[PDF]
Reclassification of SCOP entries carried out
in release 1.63 and 1.65:
Loredana Lo Conte (Last update 2004-04-07,2004-02-23)
Copyright © 1994-2009
The scop authors / scop@mrc-lmb.cam.ac.uk