Integrative structures of protein assemblies

Structures of several large protein complexes and assemblies are difficult to obtain using a single experimental or computational method. Integrative structure determination fills this gap; various types of experimental data are combined along with principles from physics, statistical inference, and prior models to obtain the structure. The different sources of input information may span multiple scales; for example, X-ray data is at the atomic scale, while FRET distances are at the domain scale. However, these sources can be complementary; for example, EM maps may provide the shape of a complex while chemical crosslinks may provide the orientation of binding interfaces. We have used structural, biochemical, biophysical, cell biological, genetic, and in-silico bioinformatics information for deducing the structure of assemblies.

Advantages

This approach has several advantages. First, the models are more complete than those generated by other methods, since proteins can be modelled at full-length, including regions of unknown structure. Second, different types of information at different scales can be combined objectively, considering the uncertainty of each experiment, and without using arbitrary weights, via the Bayesian inference framework. Third, the approach produces all models consistent with input information, allowing us to quantity the error bars (precision) of the structure. Finally, models generated by the integrative approach are validated by several methods, including by experiments based on the models, providing a high level of confidence to the determined structure.

Difference with other approaches

In contrast to other approaches to structure determination, our specific focus is on determining domain-level organisation of large assemblies, based on medium-resolution (~5-40 A) experimental data. Our models are usually coarse-grained, i.e., represented at worse than atomic resolution, therefore the focus is more on identifying overall domain-level organisation instead of the finer atomic details. Second, the input information can be noisy, ambiguous, sparse, and incoherent (i.e., based on a heterogeneous sample). Therefore, more than one model can fit the data and the integrative modeling approach produces an ensemble of models consistent with the data.

The next generation of integrative modeling methods

Of late, AI-based methods have enabled amazing advances in structural biology and it is an exciting, fast-paced field! For us, simulation and analysis of large macromolecular assemblies leads us to interesting opportunities for developing new modeling methods. We develop rigorous methods and software to make integrative structure determination more accurate and efficient by improving upon approaches that are ad hoc, semi-automated, based on trial-and-error, and/or require manual expert intervention. We use algorithms from computational statistics, statistical physics, machine learning and optimization, computer vision, and graph theory.

Areas of integrative structure determination

Opportunities exist in our group to work on several such assemblies. Example areas include assemblies involved in regulating gene expression, including chromatin remodelers, and mitochondrial assemblies. We collaborate closely with other cell and structural biologists to validate predictions from our models and generate data for the modeling.

The Nucleosome Remodeling and Deacetylase complex

We recently determined the structures of sub-complexes of the Nucleosome Remodeling and Deacetylase, NuRD complex, a chromatin-modifying assembly that regulates gene expression and is conserved across plant and animal species. Using Bayesian integrative structure determination, we combined information from SEC-MALLS, DIA-MS, XLMS, negative stain EM, X-ray crystallography, and NMR spectroscopy, secondary structure, and homology predictions. The integrative structures were further validated by independent cryo-EM maps, biochemical assays, and known cancer-associated mutations. Based on the structures, we proposed a model showing how the two enzymatic modules in the assembly maybe connected by MBD.

NuRD two-states

Two-state model of MBD3 binding in NuRD

Optimizing multi-scale coarse-grained model representations

We recently developed a method to optimize the multi-scale coarse-grained representation for modeling assemblies in IMP. Nested sampling for Optimizing Representation, i.e. NestOR uses Bayesian model selection to identify the optimal coarse-grained representation for a given integrative modeling setup. The Nested Sampling method is used to efficiently compute the model evidence for a representation.

NestOR

Bayesian model selection-based optimal representation for the integrative models

Annotating precision for integrative models

PrISM is our recently developed method to identify high and low precision regions in an ensemble of integrative models of large macromolecular assemblies. PrISM is now used in the pipeline for validating integrative models deposited in the wwPDB (worldwide Protein Data Bank), making us a part of the PDBDev Model Validation Group!

PrISM

Precision for integrative structural models