Difference between revisions of "HMMER"

From SNIC Documentation
Jump to: navigation, search
(Compilation)
 
(40 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
{{software info devel
 +
|description=package for working with profile hidden Markov models (HMM)
 +
|license=free
 +
|fields=bioinformatics
 +
}}
 +
[http://hmmer.janelia.org {{PAGENAME}}] is a software {{#show: {{PAGENAME}} |?description}} of known regions in proteins.
 +
 +
== Availability ==
 +
{{list resources for software devel}}
 +
 
== General info ==
 
== General info ==
 
[http://hmmer.janelia.org HMMER] is a software package for working with profile hidden Markov models (HMM) of known regions in proteins.
 
  
 
An HMM is a statistical model that describes the known sequence variations within a specific group of proteins that may be of special interest; for example a protein family with known function, or a domain containing a well studied interaction surface or an active site. HMM is a machine learning technique [http://en.wikipedia.org/wiki/Hidden_Markov_model] where the models are built from training examples that are known good members, and where the finished models can be used to reliably classify and annotate new or poorly understood protein sequences in an automated fashion. Large libraries of trusted HMMs (such as [http://pfam.sanger.ac.uk/ Pfam]) are of course immensely beneficial, as they can be used to automatically classify large portions of newly sequenced genomes, directly as they become available.
 
An HMM is a statistical model that describes the known sequence variations within a specific group of proteins that may be of special interest; for example a protein family with known function, or a domain containing a well studied interaction surface or an active site. HMM is a machine learning technique [http://en.wikipedia.org/wiki/Hidden_Markov_model] where the models are built from training examples that are known good members, and where the finished models can be used to reliably classify and annotate new or poorly understood protein sequences in an automated fashion. Large libraries of trusted HMMs (such as [http://pfam.sanger.ac.uk/ Pfam]) are of course immensely beneficial, as they can be used to automatically classify large portions of newly sequenced genomes, directly as they become available.
Line 10: Line 18:
 
* Matching an HMM against a sequence database (for finding new members).
 
* Matching an HMM against a sequence database (for finding new members).
 
* Matching a sequence against an HMM database (for finding new sequence features).
 
* Matching a sequence against an HMM database (for finding new sequence features).
 +
  
 
== Versions ==
 
== Versions ==
Line 18: Line 27:
 
* HMMER-3.0: Fast, but backwards incompatible and non-feature-complete.
 
* HMMER-3.0: Fast, but backwards incompatible and non-feature-complete.
  
Their implementations and output (and potentially also the actual results) are vastly different, so ongoing projects are not recommended to switch between them. For new project, it is highly recommended to spend some time to deduce which version is the most suitable.  
+
Their implementations and output (and potentially also the actual results) are vastly different, so ongoing projects are not recommended to switch between them. For new projects, it is highly recommended to spend some time to deduce which version is the most suitable.  
  
 
HMMER-3 may seem like an obvious choice; it is much faster than its predecessor and it is currently used in large scale production (e.g. by [http://pfam.sanger.ac.uk/ Pfam]), and it is also promoted as the official main HMMER version. However, HMMER-3.0 is not feature complete. Especially, the old default alignment behavior (glocal, hmm_ls) is missing, so if this feature is necessary: choose HMMER-2.3.2.
 
HMMER-3 may seem like an obvious choice; it is much faster than its predecessor and it is currently used in large scale production (e.g. by [http://pfam.sanger.ac.uk/ Pfam]), and it is also promoted as the official main HMMER version. However, HMMER-3.0 is not feature complete. Especially, the old default alignment behavior (glocal, hmm_ls) is missing, so if this feature is necessary: choose HMMER-2.3.2.
 
  
 
== Computational considerations ==
 
== Computational considerations ==
Line 27: Line 35:
 
=== Work locally ===
 
=== Work locally ===
  
Many of the features in HMMER require access to database flatfiles, and standard practice when running a compute cluster is to copy all necessary files to a node local directory before any work is done with them. This behaviour is highly encouraged on most resources, since multiple simultaneous accesses to the same large files on a shared disk is likely to cause problems for all computations currently running on the resource, and not only for the owner of the badly behaving jobs. For this reason, most SNIC resources have amenities in place to aid you in running your HMMER jobs in an optimal manner (for example <code>prepare_db</code> and <code>$HMMER_DB_DIR</code>).
+
Many of the features in HMMER require access to database flatfiles, and standard practice when running a compute cluster is to copy all necessary files to a node local directory before any work is done with them. This behaviour is highly encouraged on most resources, since multiple simultaneous accesses to the same large files on a shared disk is likely to cause problems for all computations currently running on the resource, and not only for the owner of the badly behaving jobs. For this reason, most SNIC resources have amenities in place to aid you in running your HMMER jobs in an optimal manner (for example <code>prepare_db</code> and <code>$HMMER_DB_DIR</code>, described for example [https://extras.csc.fi/mgrid/hmmer_re/hmmer232/index.html here]).
 
 
  
 
=== Do not run out of memory ===
 
=== Do not run out of memory ===
Line 34: Line 41:
 
If possible, you should ensure that you have enough RAM to hold the database as well as the results and still have some headroom. This ensures that HMMER will not need to read data from disk unnecessarily, which otherwise would cause significant slowdown. This is less important with HMMER-2.3.2, since the HMM-sequence alignment implementation is so CPU intensive that memory and disk considerations are less likely to have an impact on runtime. Nevertheless, ensuring that the database files remain cached can be done for example by:
 
If possible, you should ensure that you have enough RAM to hold the database as well as the results and still have some headroom. This ensures that HMMER will not need to read data from disk unnecessarily, which otherwise would cause significant slowdown. This is less important with HMMER-2.3.2, since the HMM-sequence alignment implementation is so CPU intensive that memory and disk considerations are less likely to have an impact on runtime. Nevertheless, ensuring that the database files remain cached can be done for example by:
  
* Choose a system with enough RAM <br/> Multiprocessor systems generally have more memory than single processor systems, and the database will also require proportionally less memory, since only one copy is needed in the OS file cache regardless of the number of processors using it.
+
* '''Choose a system with enough RAM''' <br/> Multiprocessor systems generally have more memory than single processor systems, and the database will also require proportionally less memory, since only one copy is needed in the OS file cache regardless of the number of processors using it.
* Partition the search space <br/> For huge databases or very restricted amounts available memory it may be required to split the database into manageable chunks and process them as separate jobs.  
+
* '''Partition the search space''' <br/> For huge databases or very restricted amounts available memory it may be required to split the database into manageable chunks and process them as separate jobs.
 
 
  
 
=== Use your processors wisely ===
 
=== Use your processors wisely ===
  
 
Users should not have to worry about this since the HMMER default behaviour is to run on all the CPU cores it can detect on the compute node, which is nearly always the most desirable. However, should you need to control this for any reason, the behaviour can be controlled by the <code>--ncpus</code> command line option, or the <code>$HMMER_NCPUS</code> environment variable, which should already be set to the correct value if you are using a preinstalled HMMER version on a SNIC resource.
 
Users should not have to worry about this since the HMMER default behaviour is to run on all the CPU cores it can detect on the compute node, which is nearly always the most desirable. However, should you need to control this for any reason, the behaviour can be controlled by the <code>--ncpus</code> command line option, or the <code>$HMMER_NCPUS</code> environment variable, which should already be set to the correct value if you are using a preinstalled HMMER version on a SNIC resource.
 +
 +
 +
== Compilation ==
 +
 +
Benchmarks at [http://www.nsc.liu.se NSC] have shown that Intel compilers seem to produce more efficient HMMER binaries than does gcc.
 +
 +
HMMER-2 is also extremely sensitive to hardware vectorisation optimisations, even to the extent that -xSSE4.2 produces significantly faster binaries than does -xSSE4.1, so it makes sense to set this as aggressively as possible. NSC uses icc 11.1.069 and <code>CFLAGS="-O3 -ip -xSSE4.2"</code> for the SNIC systems [http://www.nsc.liu.se/systems/snic/ Kappa] and [http://www.nsc.liu.se/systems/snic/ Matter], and <code>CFLAGS="-O3 -ip -xSSSE3"</code> for [http://www.nsc.liu.se/systems/snic/ Neolith].
 +
 +
The official binary distribution of HMMER-3.0 is compiled with icc, and is fully adequate for threaded operation. However, some of the HMMER-3 binaries (e.g. hmmscan) have been shown to perform upward of 50% better with MPI parallelisation, so this may constitute reason to recompile locally. It may also be profitable to carefully select MPI library, as intelmpi in standard configuration can give fluctuations in runtime upwards of +~100% with hmmsearch, while openmpi seemingly performs optimally and consistently. For HMMER-3, [http://www.nsc.liu.se NSC] uses icc 11.1.069 and openmpi 1.4.1, and uses the compiler flags automatically detected by the configure script.
 +
 +
== License ==
 +
{{show license}}
 +
 +
== Experts ==
 +
{{list experts}}
 +
 +
== Links ==
 +
* [http://hmmer.janelia.org/ Official website]
 +
* [ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf HMMER-3 documentation]
 +
* [http://selab.janelia.org/software/hmmer/2.3.2/hmmer-2.3.2.tar.gz HMMER-2.3.2 release] (contains pdf documentation)

Latest revision as of 08:27, 23 April 2013

HMMER is a software package for working with profile hidden Markov models (HMM) of known regions in proteins.

Availability

ResourceCentreDescription
AkkaHPC2Ncapability cluster resource of 54 TFLOPS with infiniband interconnect
KalkylUPPMAXcluster resource of about 21 TFLOPS
KappaNSCthroughput cluster resource of 26 TFLOPS
MatterNSCcluster resource of 37 TFLOPS dedicated to materials science
TriolithNSCCapability cluster with 338 TFLOPS peak and 1:2 Infiniband fat-tree

General info

An HMM is a statistical model that describes the known sequence variations within a specific group of proteins that may be of special interest; for example a protein family with known function, or a domain containing a well studied interaction surface or an active site. HMM is a machine learning technique [1] where the models are built from training examples that are known good members, and where the finished models can be used to reliably classify and annotate new or poorly understood protein sequences in an automated fashion. Large libraries of trusted HMMs (such as Pfam) are of course immensely beneficial, as they can be used to automatically classify large portions of newly sequenced genomes, directly as they become available.

The HMMER package contains applications for working with HMMs, for example for:

  • Building and calibrating HMMs.
  • Matching an HMM against a sequence database (for finding new members).
  • Matching a sequence against an HMM database (for finding new sequence features).


Versions

There are two verions of HMMER that can conceivably be useful:

  • HMMER-2.3.2: Old stable version.
  • HMMER-3.0: Fast, but backwards incompatible and non-feature-complete.

Their implementations and output (and potentially also the actual results) are vastly different, so ongoing projects are not recommended to switch between them. For new projects, it is highly recommended to spend some time to deduce which version is the most suitable.

HMMER-3 may seem like an obvious choice; it is much faster than its predecessor and it is currently used in large scale production (e.g. by Pfam), and it is also promoted as the official main HMMER version. However, HMMER-3.0 is not feature complete. Especially, the old default alignment behavior (glocal, hmm_ls) is missing, so if this feature is necessary: choose HMMER-2.3.2.

Computational considerations

Work locally

Many of the features in HMMER require access to database flatfiles, and standard practice when running a compute cluster is to copy all necessary files to a node local directory before any work is done with them. This behaviour is highly encouraged on most resources, since multiple simultaneous accesses to the same large files on a shared disk is likely to cause problems for all computations currently running on the resource, and not only for the owner of the badly behaving jobs. For this reason, most SNIC resources have amenities in place to aid you in running your HMMER jobs in an optimal manner (for example prepare_db and $HMMER_DB_DIR, described for example here).

Do not run out of memory

If possible, you should ensure that you have enough RAM to hold the database as well as the results and still have some headroom. This ensures that HMMER will not need to read data from disk unnecessarily, which otherwise would cause significant slowdown. This is less important with HMMER-2.3.2, since the HMM-sequence alignment implementation is so CPU intensive that memory and disk considerations are less likely to have an impact on runtime. Nevertheless, ensuring that the database files remain cached can be done for example by:

  • Choose a system with enough RAM
    Multiprocessor systems generally have more memory than single processor systems, and the database will also require proportionally less memory, since only one copy is needed in the OS file cache regardless of the number of processors using it.
  • Partition the search space
    For huge databases or very restricted amounts available memory it may be required to split the database into manageable chunks and process them as separate jobs.

Use your processors wisely

Users should not have to worry about this since the HMMER default behaviour is to run on all the CPU cores it can detect on the compute node, which is nearly always the most desirable. However, should you need to control this for any reason, the behaviour can be controlled by the --ncpus command line option, or the $HMMER_NCPUS environment variable, which should already be set to the correct value if you are using a preinstalled HMMER version on a SNIC resource.


Compilation

Benchmarks at NSC have shown that Intel compilers seem to produce more efficient HMMER binaries than does gcc.

HMMER-2 is also extremely sensitive to hardware vectorisation optimisations, even to the extent that -xSSE4.2 produces significantly faster binaries than does -xSSE4.1, so it makes sense to set this as aggressively as possible. NSC uses icc 11.1.069 and CFLAGS="-O3 -ip -xSSE4.2" for the SNIC systems Kappa and Matter, and CFLAGS="-O3 -ip -xSSSE3" for Neolith.

The official binary distribution of HMMER-3.0 is compiled with icc, and is fully adequate for threaded operation. However, some of the HMMER-3 binaries (e.g. hmmscan) have been shown to perform upward of 50% better with MPI parallelisation, so this may constitute reason to recompile locally. It may also be profitable to carefully select MPI library, as intelmpi in standard configuration can give fluctuations in runtime upwards of +~100% with hmmsearch, while openmpi seemingly performs optimally and consistently. For HMMER-3, NSC uses icc 11.1.069 and openmpi 1.4.1, and uses the compiler flags automatically detected by the configure script.

License

License: Free.

Experts

No experts have currently registered expertise on this specific subject. List of registered field experts:

  FieldAE FTEGeneral activities
Anders Hast (UPPMAX)UPPMAXVisualisation, Digital Humanities30Software and usability for projects in digital humanities
Anders Sjölander (UPPMAX)UPPMAXBioinformatics100Bioinformatics support and training, job efficiency monitoring, project management
Anders Sjöström (LUNARC)LUNARCGPU computing
MATLAB
General programming
Technical acoustics
50Helps users with MATLAB, General programming, Image processing, Usage of clusters
Birgitte Brydsö (HPC2N)HPC2NParallel programming
HPC
Training, general support
Björn Claremar (UPPMAX)UPPMAXMeteorology, Geoscience100Support for geosciences, Matlab
Björn Viklund (UPPMAX)UPPMAXBioinformatics
Containers
100Bioinformatics, containers, software installs at UPPMAX
Chandan Basu (NSC)NSCComputational science100EU projects IS-ENES and PRACE.
Working on climate and weather codes
Diana Iusan (UPPMAX)UPPMAXComputational materials science
Performance tuning
50Compilation, performance optimization, and best practice usage of electronic structure codes.
Frank Bramkamp (NSC)NSCComputational fluid dynamics100Installation and support of computational fluid dynamics software.
Hamish Struthers (NSC)NSCClimate research80Users support focused on weather and climate codes.
Henric Zazzi (PDC)PDCBioinformatics100Bioinformatics Application support
Jens Larsson (NSC)NSCSwestore
Jerry Eriksson (HPC2N)HPC2NParallel programming
HPC
HPC, Parallel programming
Joachim Hein (LUNARC)LUNARCParallel programming
Performance optimisation
85HPC training
Parallel programming support
Performance optimisation
Johan HellsvikPDCMaterialvetenskap30materials theory, modeling of organic magnetic materials,
Johan Raber (NSC)NSCComputational chemistry50
Jonas Lindemann (LUNARC)LUNARCGrid computing
Desktop environments
20Coordinating SNIC Emerging Technologies
Developer of ARC Job Submission Tool
Grid user documentation
Leading the development of ARC Storage UI
Lunarc Box
Lunarc HPC Desktop
Krishnaveni Chitrapu (NSC)NSCSoftware development
Lars Eklund (UPPMAX)UPPMAXChemistry
Data management
FAIR
Sensitive data
100Chemistry codes, databases at UPPMAX, sensitive data, PUBA agreements
Lars Viklund (HPC2N)HPC2NGeneral programming
HPC
HPC, General programming, installation of software, support, containers
Lilit Axner (PDC)PDCComputational fluid dynamics50
Marcus Lundberg (UPPMAX)UPPMAXComputational science
Parallel programming
Performance tuning
Sensitive data
100I help users with productivity, program performance, and parallelisation. I also work with allocations and with sensitive data questions
Martin Dahlö (UPPMAX)UPPMAXBioinformatics10Bioinformatic support
Matias Piqueras (UPPMAX)UPPMAXHumanities, Social sciences70Support for humanities and social sciences, machine learning
Mikael Djurfeldt (PDC)PDCNeuroinformatics100
Mirko Myllykoski (HPC2N)HPC2NParallel programming
GPU computing
Parallel programming, HPC, GPU programming, advanced support
Pavlin Mitev (UPPMAX)UPPMAXComputational materials science100
Pedro Ojeda-May (HPC2N)HPC2NMolecular dynamics
Machine learning
Quantum Chemistry
Training, HPC, Quantum Chemistry, Molecular dynamics, R, advanced support
Peter Kjellström (NSC)NSCComputational science100All types of HPC Support.
Peter Münger (NSC)NSCComputational science60Installation and support of MATLAB, Comsol, and Julia.
Rickard Armiento (NSC)NSCComputational materials science40Maintainer of the scientific software environment at NSC.
Szilard PallPDCMolecular dynamics55Algorithms & methods for accelerating molecular dynamics, Parallelization and acceleration of molecular dynamics on modern high performance computing architectures, High performance computing, manycore and heterogeneous architectures, GPU computing
Thomas Svedberg (C3SE)C3SESolid mechanics
Torben Rasmussen (NSC)NSCComputational chemistry100Installation and support of computational chemistry software.
Wei Zhang (NSC)NSCComputational science
Parallel programming
Performance optimisation
code optimization, parallelization.
Weine Olovsson (NSC)NSCComputational materials science90Application support, installation and help
Åke Sandgren (HPC2N)HPC2NComputational science50SGUSI

Links