psa-request Frequently Asked Questions

BMERC : psa-request : FAQ

Table of contents

  1. psa-request Frequently Asked Questions
    1. Table of contents
    2. General questions
      1. How do I view plots on my . . . ?
      2. Why can't I use my AOL/Hotmail/Yahoo/... address?
      3. Web summary page Perl error
      4. Further analysis/search with PSA results?
    3. Analyzing long or short sequences
      1. How should I analyze my long sequence?
      2. What about very short sequences?
      3. Why are very short sequences predicted as mostly loops?
      4. Different results for different lengths?
    4. WD repeat questions
      1. Why no 3D structure prediction?
      2. Why does it find so many WD repeats?
      3. Why does it find this WD repeat?
      4. So which predicted repeats are real?


General questions

How do I view plots on my . . . ?

We can't answer this question for specific systems; we don't have the expertise. We would need to have some knowledge of your mail reading program and/or operating system in order to be able to answer, and there are just too many variations out there to keep up on them all. See the
"Messages from the psa-request server" page for general information, and try the following recipe:
  1. Download the appropriate viewer for your operating system and desired plot format;
  2. In your mail reader, save the plot to a file; and
  3. Drag the plot file and drop it onto the viewer icon.
If this does not work, then you will need to consult your local system administrator.

Why can't I use my AOL/Hotmail/Yahoo/... address?

The psa-request server refused to process my request, saying something about a limit for non-academic organizations. But I am affiliated with a university [or other nonprofit research organization]. Could you please tell me how I can obtain access to your service?
The problem is that you need to use a university return addess; the psa-request server cannot tell which "aol.com" (or "hotmail.com", or "yahoo.com", etc.) users are academic, and which are commercial. If your department or institute doesn't have computers with e-mail accounts, then you should be able to get an e-mail account from your University Information Technology group; if you like, they can probably also help you set it up to forward to your free web e-mail account.

Failing that, we can set up a "commercial license" that provides short-term access for an arbitrary e-mail account for academic research. There is no fee for this service, but we do require a written request on the appropriate institutional letterhead that describes the research in one sentence (as if for a paper title), and specifies the email account, the name of the researcher who will be submitting the requests (normally the owner of the account), and the duration of the proposed research.

We apologize for the bureaucratic hassle, but the terms of the agreement that permits us to run this server do not allow us to provide unlimited access to non-academic users.

Web summary page Perl error

I am getting the following error when trying to look at the page that summarizes all of my results:
Error in Perl code: Undefined subroutine &psa_request_summary::make_summary_table called . . .
Am I doing something wrong?
No, this is an intermittent bug somewhere in the page and/or Web server. It seems to go away eventually if you click 'Reload' once or twice. To our great embarrassment, we must confess that we have been unable to eliminate this nuisance, partly because we don't have the resources to make an all-out assault on a bug that only happens occasionally. (If you happen to be a mod_perl wizard and think these symptoms sound familiar, please feel free to clue us in!)

Further analysis/search with PSA results?

. . . I was wondering whether any programs exist that can search for protein homologues using the secondary structure data derived from the PSA server as an input.
We are not aware of any such programs. However, the macroclass predicted by
Type I analysis should suggest a starting point for browsing a structural classification database (e.g. SCOP, CATH, DALI, FSSP). Unfortunately, our present crop of DSMs are designed to model structures rather generically, and therefore are not close enough to real structures to provide a starting point for more detailed modeling. In other words, it is not possible in general to provide the PDB ID of the "best" known structure corresponding to the DSM predicted for a given sequence. But that will change in the near future. We are presently working on a set of models constructed directly from PDB entries, so the secondary structure prediction will also imply an alignment to a specific tertiary structure. Unfortunately, we can't give an estimate of when this enhancement might become generally available; the best we can say is to keep an eye on the psa-request server home page.

Analyzing long or short sequences

How should I analyze my long sequence?

If your sequence is longer than the limit for the desired type of analysis (350 amino acids for Type-1 analysis and 1000 amino acids for all others), then you will need to break it into smaller subsequences in order to submit them. Since psa-request is geared toward domains, these subsequences should each correspond to a single structural domain in so far as it is possible to do this; psa-request insists on this for Type-1 analysis. For help in chopping your sequence up into probable domains, you can try our
profile library search tool; many of these profiles match entire domains.

The other possibility is to send overlapping 1000-mers for Type-2 analysis, and try to identify plausible domains based on the resulting secondary structure predictions. You should then resubmit the resulting domain candidates for Type-1 analysis, since Type-1 DSMs are sensitive to variations in sequence length (more on that below).

In any case, if there are more than five such subsequences, we would appreciate it if you would refrain from sending them between the hours of 9 a.m. to 6 p.m. Eastern time, Monday thru Friday, so that they do not delay results from other researchers. Thank you for your consideration.

What about very short sequences?

How reliable are the results using Type-2 analysis for sequences between 10 and 20 AA?

The minimum length limit for Type-1 analysis, 35 AA, was chosen because we don't have enough models that can generate sequences that short, which in turn is because there aren't any short single-domain examples in the PDB on which to base a model.

Type-2 generalizes this restriction to multidomain sequences, but it still uses statistics from single-domain proteins, since we don't have any 20 AA "domains." For that reason, results for short sequences should be taken with a big grain of salt. Even so, analyses of very short sequences may be quite reliable, or (more likely) they may be completely worthless; we have no way of testing them.

One reason such results might be less than reliable concerns the underlying data. The probability that an amino acid might appear in a given structural state is based on statistics gathered from single-domain structures exhibiting a range of sizes, but all of them at least 40 AA or so in length. Even assuming that the same set of structural states were applicable to short sequences, it is easy to believe that comparable statistics gathered from a set of very small structures might be significantly different.

Finally, before you send a very short sequence to the psa-request server, you would do well to ask your self the following question: Do you really believe that such a short sequence has a well-defined secondary or tertiary structure at all?

Why are very short sequences predicted as mostly loops?

I have analysed several 10 to 20 AA sequences using Type-2 analysis, and I always get that the probability of being in a loop is the highest. I am unsure what "loop" means as a secondary structure, and what this implies for my peptides.

A loop is defined as anything that is neither helix nor turn nor strand; you'll notice that all four probabilities for any given residue always add up to 1.

These secondary structure probabilities reflect the likelihood that a given amino acid could have a particular secondary structure state, according to the generic model. They are computed by summing over all possible assignments of consecutive residues in the sequence to consecutive states in the model, weighted by the probability that the model could generate that assignment. The states for strands and helices always occur in series in the models, with some alternate paths to allow for length variation. The probability for an amino acid to be assigned to one of these states is therefore influenced by the assignment of its neighbors. Below a certain threshold, a short sequence of "helix-like" residues would have lower helix probabilities than a longer sequence that is otherwise similar. Since there are minimum strand and helix lengths, a sufficiently short sequence won't find any strands or helices in the model short enough to fit, so the server will predict zero strand and helix probability for all residues.

So that should explain why Type-2 analysis believes you have a loop; it doesn't have enough plausible alternative ways to explain a sequence that short.

Different results for different lengths?

. . . Depending on the window of a large protein that I am studying, the structure would be predicted strongly as a beta propeller or as a mixed alpha helical/beta structure. The program was much more decisive about small segments of the protein (~300aa) than with large pieces (1000aa) . . .
Except for Type-2 DSMs, which are quite generic, each DSM is designed to cover a specific range of sequence lengths, in order to control the composition and length distributions of component secondary structures. In particular, DSMs are not constructed for lengths that do not make sense for the structure. For example, the DSMs in the
WD4 macro class cover a domain length range of 187 to 279 amino acids, reflecting constraints on internal loop length. (External loops are handled with leader and trailer models, so longer sequences can be considered as WD4 candidates. The same is not true for Type I models, some of which have strict length limits.)

The probable reason for being more "decisive" about smaller fragments is the underlying assumption that each whole sequence folds into a single domain in its entirety (water-soluble for Type-1).

So the length-dependence of DSMs in general is an expected property of the system. In fact, if you were to add (or subtract) a tail with arbitrary secondary structure to a sequence you had previously analyzed, you should worry if the result doesn't change. Put another way, your confidence in the PSA results should be a function of your confidence that the sequence you submitted comprises a domain, the whole domain, and nothing but the domain.

WD repeat questions

Why no 3D structure prediction?

Why didn't I get a 3D structure prediction for my WD repeat sequence of 11 repeats?

The PSA Sequence Analysis web server cannot always produce a 3D structure prediction for the beta propeller region of a WD-repeat sequence. This is because we only constructed models for 4 through 10 repeats -- we doubt that an eleven-bladed beta propeller is even possible. Of course, nature will probably prove us wrong, eventually. Until it does, and we have a structure to look at, we hesitate to attempt to construct an eleven-bladed beta propeller model.

However, there is another possible interpretation for a WD-repeat structure with a high number of predicted repeats. It is possible that the eleven repeats are for two smaller beta propeller domains, perhaps one of five repeats and one of six. Of course, it is difficult to be certain where to break the sequence in half -- and if it could be two propellers, why not three? If the repeats cluster into two groups separated by a long (i.e. domain-sized) loop, then the hypothesis of two WD-repeat domains separated by an intermediate domain seems more likely. Although the intermediate domain could have been inserted into a loop of a single WD-repeat domain, the long loop would be the likeliest place to divide the sequence. (And you'd still have to believe in eleven-bladed beta propellers.)

One (relatively) simple test for multiple WD-repeat domains would be to split the sequence in half at an appropriate point, and use the dynamic programming local alignment algorithm of your choice to align the sequences. If you find significant similarity, then the likeliest hypothesis is a genetically recent duplication of a single WD-repeat domain. Unfortunately, lack of self-similarity can't rule out other evolutionary scenarios, such as concatenating two evolutionarily very distant WD-repeat domains.

Why does it find so many WD repeats?

I notice that the yeast WD repeat sequence YPL183C is listed on the "YPL183C aligned repeats" page as having eight aligned repeats. However, when I submit this particular sequence for WD-repeat analysis, I get eleven predicted repeats. Why are so many repeats predicted . . . ?
The psa-request server's WD repeat prediction program is known to overpredict; that is, it errs in the direction of sensitivity to marginal repeat sequences at the cost of specificity. So it often finds more repeats in a sequence than human experts would. And the alignments shown on the WD repeat Web pages were constructed manually by experts, namely Dr. Chrysanthe Gaitatzes and Dr. Eva Neer. This means that if the server says that a given sequence is not a WD repeat, then you can trust its answer. Unfortunately, if it says that it is a WD repeat, you are faced with the problem of determining which repeats are "real." More on that below.

[also need to say something about inconsistency between most likely model and number of reported repeats. -- rgr, 21-Dec-99.]

Why does it find this WD repeat?

Sometimes, WD-repeat analysis finds potential repeats that do not match the consensus sequence published in [1]. This is because the server uses probabilities computed from the profile found on the WD repeat home page, and not the consensus sequence. For instance, suppose psa-request reports that the protein subsequence "QGSGAI" matches the first portion of the first repeat. This snippet doesn't match the consensus pattern at all:
GHxxxV
QGSGAI

But here are the probabilities assigned to the individual amino acids, given the alignment:

Q G S G A I
0.0207 0.0193 0.1186 0.1715 0.1186 0.2443

These are determined by selecting the appropriate value based on its position. Note that the two G residues have different probabilities as result of being aligned to different positions.

These probabilities are then multiplied together to get the overall probability of 2.4e-07. This should be compared to the "best possible" sequence GHTGSV with a scores 8.6e-04, the "worst possible" score for WWWWMD (among others) of 1.1e-18, and the median score of 5.7e-11 for sequence HQAHKF (among others). On a logarithmic scale, the score for QGSGAI is slightly better than halfway between the best score and the median score, which psa-request may well find plausible in the right context.

So which predicted repeats are real?

As explained above, the psa-request server's WD repeat prediction program is known to overpredict. Therefore, if homology leads you to believe that one or more repeats predicted by the server are chimerical, then you should probably trust the homology data over the server.

The next question is which repeat(s) to throw out in the case of overprediction. If a homolog of known structure exists on the WD repeat home page, and the homology is good, then finding the answer should be straightforward.

If that is not the case, then the following suggestions from Dr. Lihua Yu, author of the WD repeat predictor, may help.

In terms of how to interpret the smoothing results and how to make judgement of the predicted repeats, I used the following rules when I looked at the smoothing results myself:
  1. Completeness. Profile 1 has 12 positions and profile 2 has 15 positions. If all of those positions have high probabilities, the repeat tends to be real.

  2. Clear start and end positions. If the probabilities of those positions after the putative start and end positions of the repeat drop quickly, the repeat tend to be real too. This is especially true for the end position since the WD profile is a stronger signal than the GH start profile.

  3. Knowledge of the regular expression is a great help for making the final decision.

  4. Number of repeats. For proteins predicted with higher number of repeats, there is almost no doubt that it is a WD-repeat protein, though the number of repeats needs a careful examination of the profile probabilities. However, for proteins predicted with only 4 or 5 repeats, its membership in WD-repeat family can only be assured after confirming the predicted repeats.
To this I would add:
  1. Consistent spacing. If there is an outlying "repeat" separated from the main group by a large loop, it is probably chimerical.


References

[1] Neer EJ, Schmidt CJ, Nambudripad R & Smith TF: "The ancient regulatory-protein family of WD-repeat proteins," Nature 371, 297-300 (1994) PMID: 8090199


Go to:


Please direct your questions and comments about these Web pages and the PSA e-mail server to:

Bob Rogers <rogers@darwin.bu.edu>
BioMolecular Engineering Research Center
Boston University, Boston Massachusetts
Last modified: Mon Mar 12 13:41:10 EST 2001