Wang Zheng Yuan: October 2015

Thursday, October 29, 2015

Tips on how to run Linux in a virtual machine

Linux has been prevalent inside of the enterprise data center for many years. LAMP services, Web servers, proxies, firewalls and load balancers are just a few of the many use cases that Linux has been providing the base operating system for. With the ease of use and documentation improving, many Linux distributions have seen a rise in use over the past decade or so. Somewhere along this increased usage timeline, we also introduced virtualization into our data centers. And with that, there are a few things to keep in mind when running Linux in a virtual machine.

Logical volume management

Logical volume management (LVM)is a technology included in many recent Linux distributions, which allows administrators to perform a number of tasks as it pertains to disk and partition management. Some of the striping features -- extending or striping data across multiple disks -- may not be prevalent in the virtualization world, as users are typically storing data on the same storage area network or data store. With that said, LVM provides other interesting functionality as well. By enabling LVM, administrators get the ability to perform online file system extends -- growing different partitions and file systems on the fly while keeping the file system online and accessible. Also, depending on strict compliance requirements, LVM allows us to perform volume-based snapshots for backup and recovery purposes, without invoking the vSphere-based functionality.

My advice is to partition your VM with LVM if you have a strict availability policy on your workload and take advantage of the online resizing functionality. If you do not require a high amount of uptime or do not plan on running separate partitions within your Linux installation, then leave LVM disabled, as the complexity will far outweigh the benefits.

Partition options

A default installation of Linux usually prompts the user to simply use one partition for all the files. This is OK in some instances, but as you try and tweak and improve your VMs' security and performance, having separate partitions for items such as /tmp, /var, /home and /usr can start to make sense -- especially if you want each and every partition to have different mount options. Mount options are applied to partitions by specifying them on the corresponding line within the /etc/fstab file, as shown below:

UUID=0aef28b9-3d11-4ab4-a0d4-d53d7b4d3aa4 /tmp ext4 defaults,noexec 1 2

If we take the example of a Web server, one of the most common use cases for Linux in a virtual machine, we can quickly see how some of the "default" mounting options end up hurting our security and performance initiatives.

Noatime/atime/relatime: These are mount options that dictate how the timestamps on the files contained within the partition are handled. On olderLinux distributions, 'atime' was the default, meaning the OS would write a timestamp to the files metadata every time it is read -- yes, simply a read invokes this). In terms of a Web server that is serving up files to the world all day long, you can imagine the overhead of this process. By specifying 'noatime' on the partition housing your Web server's data, you are able to alleviate the server of this overhead by not updating access time at all. The default option on newer distributions is 'relatime,' bringing the best of both worlds and only updating access time if the modification time happens to be newer.

Noexec/exec: This disables or enables the execution of binary files on a given partition. In terms of our Web server example, it makes perfect sense to mount a /tmp partition with 'noexec'. In fact, many hardening guides suggest using this option to improve security.

Users should use caution when changing access time parameters. Some applications, such as mail-related functionality, will require a full 'atime' mount option. In the Web server example, as long as access and security guidelines allow it, mount your Web server data with 'noatime.' In terms of noexec, use this option wisely, as many automated installers and packages may extract to /tmp and execute from there. This is something that can easily be turned on and off, but at the very least, I'd specify 'noexec' for /tmp.

VMXNET3 and PVSCSI

Utilizing the VMXNET3 network adapter and paravirtualized disk adapter has been the recommendation for quite some time now inside of a VM. On a Windows-based VM, we can simply specify these and the drivers are installed automatically within VMware tools. Linux presents a few challenges around utilizing this hardware. First, new versions of Linux distributions quite often come with their own driver for the VMXNET3 adapter and will use those drivers by default, even if VMware tools are installed.

Older Linux distributions may contain an outdated version of the VMXNET3 driver, which might not give you the full feature set that is included within VMware tools version. VMware's KB2020567 outlines how to enable certain features within the VMXNET driver. If you want to install the VMXNET3 driver within VMware tools, one can specify the following option during the VMware tools install:

./vmware-install.pl –clobber-kernel-modules=vmxnet3

The paravitualized SCSI adapter is a great way to gain a little extra throughput at some lower CPU cost and is usually recommended as the adapter of choice. Make sure to check the supported OS list before making this choice to ensure your kernel or distribution is supported by the paravirtual SCSI adapter.

I advise admins to use VMXNET3 and PVSCSI, if possible. If you're using an older kernel, install the VMware Tools VMXNET3 version. If you're using a newer kernel, use the native Linux driver inside your distribution.

Memory management

The Linux OS is constantly moving memory pages from physical memory into its local swap partition. This is by design, and, in fact, VMware does the same thing with its memory management functionality. But the Linux memory management behaves somewhat differently and will move memory pages even if physical memory -- which is now virtual memory -- is available. In order to cut down on the swap activity inside a Linux VM, we can adjust a 'swapiness' value. A higher value indicates more movements, while a lower value indicates memory will be left alone more often. To adjust this value, simply add the line "Vm.swappiness=##" inside of /etc/sysctl.conf and reboot -- replacing ## with your desired value.

I prefer to change this value to a lower number than the default of 60. There is no point in having both the OS and vSphere managing your memory swap. Again, it really depends on the application, but I normally set this to a value of between 15 and 20.

I/O scheduler

Just as ESXi does a great job with management memory, it also has its own special sauce as it pertains to scheduling I/O and writes to disk. Again, the Linux OS duplicates some of this functionality within itself. As of kernel 2.6, most distributions have been utilizing the Completely Fair Queuing as the default I/O scheduler. The others available are NOOP, Anticipatory and Deadline. VMware explains just how to change this value and why you might want to, as there is no sense in scheduling I/O twice. In short, the default I/O scheduler used within your Linux kernel can be switched by appending an elevator switch to your grub kernel entry.

There's no need to schedule within the OS, and then schedule once more in the hypervisor. I suggest using the NOOP I/O scheduler, as it does nothing to optimize the disk I/O, allowing vSphere to manage all of it.

Remove unused hardware and disable unnecessary services

How many times have you used that virtual floppy disk inside of your VMs in the past year? How about that internal PC speaker? If you don't plan on using these devices, then the simple answer is to blacklist them from loading within the Linux kernel. The commands to remove a floppy are as follows:

echo "blacklist floppy" | tee /etc/modprobe.d/blacklist-floppy.conf

rmmod floppy

update-initramfs -u

Also, there is no need to stop at unused hardware. While you're at it, you might as well disable any virtual consoles that you probably aren't using as well. This can be done by commenting out tty lines within /etc/inittab, as shown below.

1:2345:respawn:/sbin/getty 38400 tty1
2:23:respawn:/sbin/getty 38400 tty2
#3:23:respawn:/sbin/getty 38400 tty3
#4:23:respawn:/sbin/getty 38400 tty4
#5:23:respawn:/sbin/getty 38400 tty5
#6:23:respawn:/sbin/getty 38400 tty6

I suggest that you get rid of the floppy disk. Keep in mind that you will also have to remove the hardware from the VMs configuration, as well as disable it within the VM's BIOS. Some other services that you can safely blacklist include monitors raid configuration (mptctl), pcspker, snd_pcm, snd_page_alloc, snd_timer, snd, snd_soundcore (all have to do with sound), coretemp (monitors temperateure of CPUs), parport and parport_pc (parrallel ports).

As always, be sure you aren't actually utilizing any of these services before blacklisting them. Also, I always leave a couple of virtual consoles enabled in the event that I might use them, but six is a bit overkill.

These are just a few items to keep in mind when running Linux virtualized in a VMware environment. In terms of performance gains, the age old saying of "it depends" applies to each and every one of them. You may see more performance increase out of some of the tweaks, and degradation out of others. As always, it's good practice to test any of these changes in a lab environment before implementing them in production. Technology is constantly changing and with that, so are the best practices. So, if you have any other tips or tricks, leave them in the comments below.

Monday, October 26, 2015

StartSSL: Free CA certificates

StartSSL™ Free

The StartSSL™ Free (Class 1) digital certificates are provided by StartCom without charge. They provide modest assurances and are meant to secure personal web sites, public forums or web mail. Verification is done automatic and instantly by electronic means and mostly without the interference and involvement of our personnel. StartSSL™ Free supports:

Web server certificates (SSL/TLS)
Client and mail certificates (S/MIME)
128/256-bit encryption
US $ 10,000 insurance guaranteed
Valid 1 year (365 days)

https://www.startssl.com

DOE Seeks to Mend HPC Talent Gap

Get a group of HPC stakeholders in a room and it won’t be long before they are bemoaning the talent shortage, the gap between the demand for a well-trained HPC workforce versus the number of qualified candidates available to fulfill these positions. Despite the attention paid to this topic, the HPC talent gap has been a thorn in a field that is increasingly understood to be synonymous with a nation’s leadership potential. And while the issue crosses industry and government boundaries, the Department of Energy (DOE) has additional reason to be concerned due to the significance of its mission, which includes such sensitive areas as cyber- and nuclear security, not to mention being a testbed for innovation and discovery.
Given the importance of scientific computation to the federally-funded DOE centers, the DOE’s Office of Science set out to explore the issue further, by charging the Advanced Scientific Computing Advisory Committee (ASCAC) with identifying the causes of the shortage and the solutions considered most likely to reverse the trend. At SC14 in New Orleans, the ASCAC subcommittee Chairperson Barbara Chapman of the University of Houston revealed key findings and recommendations, further detailed in this 26-page report.
The study’s authors collected data from national laboratories, university computing programs and previous reports on workforce preparedness, noting several relevant trends that shed light on different dimensions of this challenge. The initial finding confirmed the prevailing suspicions that indeed all DOE national laboratories face workforce recruitment and retention challenges in computing sciences fields relevant to labs’ missions. The situation could become even more problematic as large numbers of DOE employees are expected to retire in the coming decade.
The major contributor for the talent shortage is the lack of computing science graduates, with US citizens, females and minorities being especially underrepresented. For example, foreign nationals currently account for more than half of the graduate students in Ph.D.-granting computer science programs. This leaves labs seeking candidates from an international pool, which has the effect of extending already-long lag times between the time a position is posted and when it can be filled. The report notes that it takes 100 days to fill a DOE job versus 48 to 50 days to fill a similar position in industry. When US citizenship is a requirement, as is the case at the DOE National Nuclear Security Administration (NNSA) labs, it can take upwards of 200 days to fill a position.
The subcommittee further reported a lack of diversity in the talent pool. The percentage of women graduating with computing degrees is just 17.2 percent for computer science and 18 percent for all computing doctorates. Hispanic and African-American students comprise less than 4 percent of computing doctorate recipients.
Another factor according to the study is an uneven distribution in specialties with “hot” topics like artificial intelligence and robotics being favored at the expense of a solid HPC foundation in algorithms, applied mathematics, data analysis and visualization, and high-performance computing systems. These skills are cross-disciplinary, requiring a mix of computing, math and science skills. Although well-designed computational science degrees and specializations are popping up at institutions across the country, these are still ad-hoc programs and not prevalent enough at this juncture to significantly amend the shortage.
One of the most effective tools that addresses these primary root causes, both the lack of interest in computing degrees and in the core HPC subject matter, is outreach and recruitment.
The subcommittee pointed to several already-established DOE-facilitated programs as being critical to bolstering computing sciences workforce. The list of successful outreach efforts includes a five-year program from the Nuclear Science and Security Consortium partnership. Focused on training a generation of nuclear scientists, the program reached more than 100 students since it was established. The DOE Computational Science Graduate Fellowship (DOE CSGF) is another one that that is paying dividends, rated highly effective in multiple reviews, according to Chapman. The fellowship trains students in interdisciplinary knowledge and provides DOE lab experience. The subcommittee recommends expanding the DOE CSGF program and using it as a model for new fellowship programs in areas pertinent to DOE lab needs, such as exascale algorithms and extreme computing.
Given the shortfalls of existing academic programs in meeting the needs of current and future methodologies, such as exascale computing, the subcommittee recommends the establishment of DOE-supported computing leadership graduate curriculum advisory group to publish curricular competencies guidelines at the graduate and undergraduate level with the aim of influencing curriculum development efforts.
Other recommendations focused on the importance of boosting the DOE’s visibility on university and college campuses as well as the need to work with other agencies to “pro-actively recruit, mentor and increase the involvement of significantly more women, minorities, people with disabilities, and other underrepresented populations into active participation in CS&E careers.”
The subcommittee notes that the interesting career opportunities offered by the DOE national laboratories will be a natural draw with increased awareness, yet elements of lab culture could be made more appealing to today’s mobile generation. In this regard, the report recommends uniform measures across the DOE laboratories to facilitate incentives like ongoing relocation assistance, lifetime professional development, and a sabbatical program. In order to implement such strategies, the DOE would need to examine the laboratory funding model and its relationship to recruiting and retention.
“This is an issue for the whole supercomputing community,” says Chapman in a DOE feature article on the report. “Meeting the mission-critical workforce needs of the national laboratories will require leadership to address this lack of diversity and to design outreach programs to attract a more diverse student population.”

Tuesday, October 13, 2015

Platypus: A Haplotype-Based Variant Caller For Next Generation Sequence Data

Reference	Andy Rimmer, Hang Phan, Iain Mathieson, Zamin Iqbal, Stephen R. F. Twigg, WGS500 Consortium, Andrew O. M. Wilkie, Gil McVean, Gerton Lunter. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nature Genetics (2014) doi:10.1038/ng.3036
Quick links	Download the latest stable version of Platypus User forum and update notifications (Google group) Examples of how to run Platypus in different settings Frequently Asked Questions. Full documentation

Description	Platypus is a tool designed for efficient and accurate variant-detection in high-throughput sequencing data. By using local realignment of reads and local assembly it achieves both high sensitivity and high specificity. Platypus can detect SNPs, MNPs, short indels, replacements and (using the assembly option) deletions up to several kb. It has been extensively tested on whole-genome, exon-capture, and targeted capture data, it has been run on very large datasets as part of the Thousand Genomes and WGS500 projects, and is being used in clinical sequencing trials in the Mainstreaming Cancer Genetics programme. Platypus has been thoroughly tested on data mapped with Stampy and BWA. It has not been tested with other mappers, but it should behave well. Platypus has been used to detect variants inhHuman, mouse, rat and chimpanzee samples, amongst others, and it should perform well on data from any diploid organism. It has also been used to find somatic mutations in cancer, and mozaic mutations in human exome data.
Capabilities	Platypus reads data from BAM files, and outputs a single VCF file containing a list of identified variants, and genotype calls and likelihoods for all samples. It can identify SNPs, MNPs and short (less than one read length) indels, and larger (up to several kb deletions and maybe 200bp insertions) variants using local assembly. Platypus can process large amounts of BAM data very efficiently, and can handle samples spread across multiple BAM files. Duplicate read marking, local re-alignment, and variant identification and filtering are performed on-the-fly using a single command. Platypus will run on any input data in BAM format, but has only been properly tested on Illumina data.
Dependencies	Platypus is written in Python, Cython and C. It requires only Python (>=2.6) and a C compiler to build; these are standard on most linux and Mac OS distributions, and Platypus should build and run without problems for most people.
Building Platypus	To build Platypus, simply un-pack the tar-ball and run the buildPlatypus.sh script provided: tar -xvzf Platypus_x.x.x.tgz cd Platypus_x.x.x ./buildPlatypus.sh This will take a minute or so, and generate quite a lot of warnings. If the build is successful, you will see a message, 'Finished building Platypus'. Platypus is then ready for variant-calling.
Running Platypus	Platypus can be run from the command-line, using Python. It needs 1 or more BAM input files, and a FASTA reference file. The BAM file(s) must be indexed using Samtools or an equivalent program, and the FASTA file must also be indexed using 'samtools faidx' or equivalent. The simplest way to tun Platypus is as follows: python Platypus.py callVariants --bamFiles=input.bam --refFile=ref.fa --output=VariantCalls.vcf The output will be a single VCF file containing all the variants that Platypus identified, and a 'log.txt' file, containing log information. The last line in the log file, and on the command-line output, should be 'Finished variant calling'. This means that the calling has completed without major errors. It is a good idea to also check the log output for warnings or errors.
Contact:	Bug reports, comments, and feature requests (positive feedback also greatly appreciated) can be sent to platypus-users@googlegroups.com

Monday, October 12, 2015

Dell to Buy EMC for $67B; Sharpen Focus on Large Enterprises and High-End Computing

Dell will buy the storage provider EMC (NYSE: EMC) in a deal worth about $67 billion, reports The New York Times today (Dell Announces Purchase of EMC for $67 Billion). Dell and the investment firm Silver Lake, its financial backer, are betting that a huge acquisition will help one of the best-known names in the industry keep up with the rapidly changing technology industry.
“Under the terms of the deal, Dell will pay $33.15 per EMC share, which includes cash plus tracking stock linked to part of EMC’s economic interest in VMware, a publicly traded business. The Dell founder and chief executive, Michael S. Dell will lead the combined company as chairman and chief executive.

“For the last two years, since taking the company private, Mr. Dell and Silver Lake have been trying to help it adapt to a changed tech landscape. Buying EMC brings Dell one of the biggest names in computer data storage, adding to existing offerings like network servers, corporate software and mobile devices,” according to the NYT report.
Addison Snell, CEO of analyst firm Intersect360 Research said, “With the number-one market share company for HPC servers purchasing the number-one market share company for HPC storage, of course this deal has ramification for HPC users. For Dell it is a chance to drive an end-to-end high-performance data strategy, which has been an area they have sought to improve. For EMC, it is a chance to consolidate separate lines into a cohesive high-performance strategy — right now, Isilon and XtremIO are isolated businesses. And with virtualization now on the rise in HPC environments, Dell will want to see if VMware can be a valuable component.”
The Wall Street Journal reports the deal “cements Dell’s transition from a consumer-oriented company to one focused on technology for large companies (Dell to Buy EMC for $67 Billion)…[in the] biggest technology-industry takeover ever.” Formal announcement of the deal follows widespread reports last week of the imminent merger.
Talking about the change, EMC CEO Joseph Tucci is quoted in a Boston Globe report: “I’m tremendously proud of everything we’ve built at EMC – from humble beginnings as a Boston-based startup to a global, world-class technology company with an unyielding dedication to our customers…But the waves of change we now see in our industry are unprecedented and, to navigate this change, we must create a new company for a new era.’’
There is fascinating banter going on at CrowdChat offering a swath of opinion ranging from IBM will win big time, at least short term, to growing competition from others in the ‘VM’ space, to praise for expanded offerings from Dell. It’s worth checking out: https://www.crowdchat.net/post/363771.
One analyst told EnterpriseTech, HPCwire’s sister publication, that the deal is likely to bring out competitors in full force. “Near term, I would be surprised if some of the Dell and EMC competitors were not out in full force with FUD and MUD trying to disrupt and create barriers to either Dell or EMC closing near-term business, both the Dell and EMC sales teams need to get in front of their customers, partners and prospects to remove those FUD and MUD barriers to avoid becoming part of a revenue prevention department,” said Greg Schulz, Sr. Advisory Analyst, Server and StorageIO Group.
On the plus side Schulz said, “For Dell high-end computing customers, this opens up some interesting scenarios as they will now have access to technologies such as EMC Isilon for data lakes using HDFS and NAS, or XtremIO for block based fast flash SSD storage clusters, or the new all flash DSSD for rack scale shared PCIe SSD among others. Keep in mind that EMC acquired DG back in the late 90s who had servers and storage (e.g. the Clarrion that became todays VNX which Dell used to OEM). EMC still has strong server DNA as they use a lot of them across their products.”

Friday, October 9, 2015

Stepping Up to the Life Science Storage System Challenge

Storage and data management have become the perhaps the most challenging computational bottlenecks in life sciences (LS) research. The volume and diversity of data generated by modern life science lab instruments and the varying requirements of analysis applications make creating effective solutions far from trivial. What’s more, where LS was once adequately served by conventional cluster technology, HPC is now becoming important – one estimate is 25% of bench scientists will require HPC resources in 2015.

Currently, the emphasis is on sequence data analysis although imaging data is quickly joining the fray. Sometimes the sequence data is generated in one place and largely kept there – think of major biomedical research and sequencing centers such as the Broad Institute and Wellcome Trust Sanger Institute. Other times, the data is generated by thousands of far-flung researchers whose results must be pooled to optimize LS community benefit – think of the Cancer Genome Hub (CGHub) at UC Santa Cruz, which now holds about 2.3 petabytes of data, all contributed from researchers spread worldwide.
Given the twin imperatives of collaboration and faster analysis turnaround times, optimizing storage system performance is a high priority. Complicating the effort is the fact that genomics analysis workflows are themselves complicated and each step can be IO or CPU intensive and involve repetitively reading and writing many large files to and from disk. Beyond the need to scale storage capacity to support what can be petabytes of data in a single laboratory or organization, there is usually a need for a high-performance distributed file system to take advantage of today’s high core density, multi-server compute clusters.
Broadly speaking, accelerating genomics analysis pipelines can be tricky. CPU and memory issues are typically easier to resolve. Disk throughput is often the most difficult variable to tweak and researchers report it’s not always clear which combination of disk technology and distributed file system (NFS, GlusterFS, Lustre, PanFS, etc.) will produce the best results. IO is especially problematic.
Alignment and de-duplication, for example, is usually a multi-step disk intensive process: Perform alignment and write BAM file to disk, sort original BAM file to disk, deduplicate BAM file to disk. Researchers are using a full arsenal of approaches – powerful hardware, parallelization, algorithm refinement, storage system optimization – to accelerate throughput. Simply put, storage infrastructures must address two general areas:

The infrastructure must be capable of handling the volume of data being generated by today’s lab equipment. Modern DNA sequencers already produce up to a few hundred TB per instrument per year, a rate that is expected to grow 100-fold as capacities increase and more annotation data is captured. With many genomics workflows, many terabytes of data must routinely be moved from the DNA sequencing machines that generate the data to the computational component that performs the DNA alignment, assembly, and subsequent genomic analysis.

The analysis process, multiple tools are used on the data in its various forms. Each of the tools has different IO characteristics. For example, in a typical workflow, the output data from a sequencer might be prepared in some way, partitioning it into smaller working packages for an initial analysis. This type of operation involves many read/writes and is IO bound. The output from that step might then be used to perform a read alignment. This operation is CPU-bound. Finally, the work done on the smaller packages of data needs to be sorted and merged, aggregating the results into one file. This process requires many temporary file writes and is IO bound.

One the compute side, there are a variety of solutions available to help meet the raw processing demands of today’s genomics analysis workflows. Organizations can select high performance servers with multi-core, multi-threaded processors; systems with large amounts of shared memory; analytics nodes with lots of high-speed flash; systems that make use of in-memory processing; and servers that take advantage of co-processors or other forms of hardware-based acceleration.
One the storage side, the choices can be more limiting. Life sciences code, as noted earlier, tends to be IO bound. There are large numbers of rapid, read/write calls. The throughput demands per core can easily exceed practical IO limitations. Given the size and number of files being moved in genomic analysis workflows, traditional NAS storage solutions and NFS-based file systems frequently don’t scale out adequately and slow performance. High performance parallel file systems such as Lustre and the General Parallel File System (GPFS) are often needed.
Determining the right file system for use isn’t always straightforward. One example of this challenge is a project undertaken by a major biomedical research organization seeking to conduct whole genome sequencing (WGS) analysis on 1,500 subjects; that translated into 110 terabytes (TB) of data with each whole genome sample accounting for about 75GB. Samples were processed in batches of 75 to optimize throughput, requiring about 5TB of data to be read and written to disk multiple times during the 96 hour processing workflow, with intermediate files adding another 5TB to the I/O load.
Many of processing steps were IO intensive and involved reading and writing large 100GB BAM files to and from disk. These did not scale well. Several strategies were tried (e.g., upgrading the network bandwidth, minimizing IO operations, improving workload splitting). Despite the I/O improvements, significant bottlenecks remained in running disk intensive processes at scale. Specifically the post-alignment processing slowed down on NFS shared file systems due to a high number of concurrent file writes. In this instance, switching to Lustre delivered a threefold improvement in write performance.
Conversely Purdue chose GPFS during an upgrade of its cyberinfrastructure which serves a large community of very diverse domains.
“We have researchers pulling in data from instruments to a scratch file and this may be the sole repository of their data for several months while they are analyzing it, cleaning the data, and haven’t yet put it into archives,” said Mike Shuey, research infrastructure architect at Purdue. “We are taking advantage of a couple of GPFS RAS (reliability, availability, and serviceability) features, specifically data replication and snapshot capabilities to protect against site-wide failure and to protect against accidental data deletion. While Lustre is great for other workloads – and we use it in some places – it doesn’t have those sorts of features right now,” said Shuey.
LS processing requirements – a major portion of Purdue’s research activity – can be problematic in a mixed-use environment. Shuey noted LS workflows often have millions of tiny files whose IO access requirements can interfere with the more typical IO stream of simulation applications; larger files in a mechanical engineering simulation, for example, can be slowed by accesses to these millions of tiny files from a life sciences workflow. Purdue adopted deployed DataDirect Networks acceleration technology to help cope with this issue.
Two relatively new technologies that continue to gain traction are Hadoop and iRODS.

Hadoop, of course, uses a distributed file system and framework (MapReduce) to break large data sets into chunks, to distribute/store (Map) those chunks to nodes in a cluster, and to gather (Reduce) results following computation. Hadoop’s distinguishing feature is it automatically stores the chunks of data on the same nodes on which they will be processed. This strategy of co-locating of data and processing power (proximity computing) significantly accelerates performance.
It also turns out that Hadoop architecture is a good choice for many life sciences applications. This is largely because so much of life sciences data is semi- or unstructured file-based data and ideally suited for ‘embarrassingly parallel’ computation. Moreover, the use of commodity hardware (e.g. Linux cluster) keeps cost down, and little or no hardware modification is required. Conversely issues remain, say some observers.
“[W]hile genome scientists have adopted the concept of MapReduce for parallelizing IO, they have not embraced the Hadoop ecosystem. For example, the popular Genome Analysis Toolkit (GATK) uses a proprietary MapReduce implementation that can scale vertically but not horizontally…Efforts exist for adapting existing genomics data structures to Hadoop, but these don’t support the full range of analytic requirements,” noted Alan Day (principal data scientist) and Sungwook Yoon (data scientist) at vendor MapR, in a blog post during the Strata & Hadoop World conference, held earlier this month.
MapR’s approach is to implement an end-to-end analysis pipeline based on GATK and running on Hadoop. “The benefit of combining GATK and Hadoop is two-fold. First, Hadoop provides a more cost-effective solution than a traditional HPC+SAN substrate. Second, Hadoop applications are much easier for software engineers to design and scale,” they wrote, adding the MapR solution follows Hadoop and the GATK best practices. They argue results can be generated on easily available hardware and users can expect immediate ROI by moving existing GATK use cases to Hadoop.

iRODS solves a different challenge. It is a data grid technology that essentially puts a unified namespace on data files, regardless of where those files are physically located. You may have files in four or five different storage systems, but to the user it appears as one directory tree. iRODS also allows setting enforcement rules on any access to the data or submission of data. For example, if someone entered data into the system, that might trigger a rule to replicate the data to another system and compress it at the same time. Access protection rules based on metadata about a file can be set.[1]
At the Renaissance Computing Institute (RENCI) of the University of North Carolina, iRODS has been used in several aspects of its genomics analysis pipeline. When analytical pipelines are processing the data they also register that data into iRODS, according to Charles Schmitt, director of informatics, RENCI[iv]. At the end of the pipeline, the data exists on disks and is registered into iRODS. Anyone wanting to use the data must come in through iRODS to get the data; this allows RENCI to set policies on access and data use.
Broad benefits cited by the iRODS consortium include:

iRODS enables data discovery using a metadata catalog that describes every file, every directory, and every storage resource in the data grid.
iRODS automates data workflows, with a rule engine that permits any action to be initiated by any trigger on any server or client in the grid.
iRODS enables secure collaboration, so users only need to log in to their home grid to access data hosted on a remote grid.
iRODS implements data virtualization, allowing access to distributed storage assets under a unified namespace, and freeing organizations from getting locked in to single-vendor storage solutions.

It’s worth noting that RENCI has been an important participant in iRODS consortium whose members include, for example, Seagate (NASDAQ: STX), DDN, Novartis (NYSE: NVS), IBM(NYSE: IBM), Wellcome Trust Sanger Institute, and EMC (NYSE: EMC).

[1] RENCI white paper, Life sciences at RENCI: Big Data IT to manage, decipher, and inform, http://www.emc.com/collateral/white-papers/h11692-life-sciences-renci-big-data-manage-info-wp.pdf

The Revolution in the Lab is Overwhelming IT

Sifting through the vast treasure trove of data spilling from modern life science instruments is perhaps the defining challenge for biomedical research today. NIH, for example, generates about 1.5PB of data a month, and that excludes NIH-funded external research. Not only have DNA sequencers become extraordinarily powerful, but also they have proliferated in size and type, from big workhorse instruments like the Illumina HiSeq X Ten, down to reliable bench-top models (MiSeq) suitable for small labs, and there are now in advanced development USB-stick sized devices that plug into a USB port.
“The flood of sequence data, human and non-human that may impact human health, is certainly growing and in need of being integrated, mined, and understood. Further, there are emerging technologies in imaging and high resolution structure studies that will be generating a huge amount of data that will need to be analyzed, integrated, and understood,”[i] said Jack Collins, Director of the Advanced Biomedical Computing Center at the Frederick National Laboratory for Cancer Research, NCI.

Here are just a few of the many feeder streams to the data deluge:

DNA Sequencers. An Illumina (NASDAQ: ILMN) top-of-the-line HiSeq Ten Series can generate a full human genome in just 18 hours (and generate 3TB) and deliver 18000 genomes in a year. File size for a single whole genome sample may exceed 75GB.
Live cell imaging. High throughput imaging in which robots screen hundreds of millions of compounds on live cells typically generate tens of terabytes weekly.
Confocal imaging. Scanning 100s of tissue section, with sometimes many scans per section, each with 20-40 layers and multiple fluorescent channels can produce on the order of 10TB weekly.
Structural Data. Advanced investigation into form and structure is driving huge and diverse datasets derived from many sources.

Broadly, the flood of data from various LS instruments stresses virtually every part of most research computational environment (cpu, network, storage, system and application software). Indeed, important research and clinical work can be delayed or not attempted because although generating the data is feasible, the time required to perform the data analysis can be impractical. Faced with these situations, research organizations are forced to retool the IT infrastructure.
“Bench science is changing month to month while IT infrastructure is refreshed every 2-7 years. Right now IT is not part of the conversation [with life scientists] and running to catch up,” noted Ari Berman, GM of Government Services, the BioTeam consulting firm and a member of Tabor EnterpriseHPC Conference Advisory Board.
The sheer volume of data is only one aspect of the problem. Diversity in files and data types further complicates efforts to build the “right” infrastructure. Berman noted in a recent presentation that life sciences generates massive text files, massive binary files, large directories (many millions of files), large files ~600Gb and very many small files ~30kb or less. Workflows likewise vary. Sequencing alignment and variant calling offer one set of challenges; pathway simulation presents another; creating 3D models – perhaps of the brain and using those to perform detailed neurosurgery with real-time analytic feedback
“Data piles up faster than it ever has before. In fact, a single new sequencer can typically generate terabytes of data a day. And as a result, an organization or lab with multiple sequencers is capable of producing petabytes of data in a year. The data from the sequencers must be analyzed and visualized using third-party tools. And then it must be managed over time,” said Berman.

An excellent, though admittedly high-end, example of the growing complexity of computational tools being contemplated and developed in life science research is presented by the European Union Human Brain Project[ii] (HBP). Among its lofty goals are creation of six information and communications technology (ICT) platforms intended to enable “large-scale collaboration and data sharing, reconstruction of the brain at different biological scales, federated analysis of clinical data to map diseases of the brain, and development of brain-inspired computing systems.”
The elements of the planned HPC platform include[iii]:

Neuroinformatics: a data repository, including brain atlases.
Brain Simulation: building ICT models and simulations of brains and brain components.
Medical Informatics: bringing together information on brain diseases.
Neuromorphic Computing: ICT that mimics the functioning of the brain.
Neurorobotics: testing brain models and simulations in virtual environments.
HPC Infrastructure: hardware and software to support the other Platforms.

(Tellingly HBP organizers have recognized the limited computational expertise of many biomedical researchers and also plan to develop technical support and training programs for users of the platforms.)
There is broad agreement in the life sciences research community that there is no single best HPC infrastructure to handle the many LS use cases. The best approach is to build for the dominant use cases. Even here, said Berman, building HPC environments for LS is risky, “The challenge is to design systems today that can support unknown research requirements over many years.” And of course, this all must be accomplished in a cost-constrained environment.
“Some lab instruments know how to submit jobs to clusters. You need heterogeneous systems. Homogeneous clusters don’t work well in life sciences because of the varying uses cases. Newer clusters are kind of a mix and match of things we have fat nodes with tons of cpus and thin nodes with really fast cpus, [for example],” said Berman.
Just one genome, depending upon the type of sequencing and the coverage, can generate 100GB of data to manage. Capturing, analyzing, storing, and presenting the accumulating data requires a hybrid HPC infrastructure that blends traditional cluster computing with emerging tools such as iRODS (Integrated Rule-Oriented Data System) and Hadoop. Unsurprisingly the HPC infrastructure is always a work in progress
Here’s a snapshot of the two of the most common genomic analysis pipelines:

DNA Sequencing. DNA extracted from tissue samples is run through the high-throughput NGS instruments. These modern sequencers generate hundreds of millions of short DNA sequences for each sample, which must then be ‘assembled’ into proper order to determine the genome. Researchers use parallelized computational workflows to assemble the genome and perform quality control on the reassembly—fixing errors in the reassembly.
Variant Calling. DNA variations (SNPs, haplotypes, indels, etc) for an individual are detected, often using large patient populations to help resolve ambiguities in the individual’s sequence data. Data may be organized into a hybrid solution that uses a relational database to store canonical variations, high-performance file systems to hold data, and a Hadoop-based approach for specialized data-intensive analysis. Links to public and private databases help researchers identify the impact of variations including, for example, whether variants have known associations with clinically relevant conditions.

The point is that life science research – and soon healthcare delivery – has been transformed by productivity leaps in the lab that now are creating immense computational challenges. (next Part 2: Storage Strategies)
[i] Presented on a panel at Leverage Big Data conference, March 2015; http://www.leveragebigdata.com;
[ii] https://www.humanbrainproject.eu/
[iii] https://www.humanbrainproject.eu/discover/the-project/platforms;jsessionid=emae995mioyqxt99x2a14ljg

The Best of HPC in 2014

As we turn the page to 2015, we’re taking a look back at the top stories from 2014 to reflect on just how far the fastest machines in the world (and the people who run them) have come within only 12 months. From the blending of big data and HPC to hardware breakthroughs, 2014 points to big things in store for 2015 and beyond, whether we’re talking about HPC for enterprise or the path to exascale.

Along the way we’ve seen leaps toward achieving functional quantum computers, advancements in accelerators, steps toward building more energy-efficient systems, as well as big news from major players such as IBM, Intel and NVIDIA. And from basic science to healthcare, we’ve seen supercomputers take a leading role in research efforts that have saved lives and granted us a deeper understanding of the world around us.
Below we’ve selected the top stories from the HPCwire archive that highlight the course of the industry throughout 2014 as well as paint an ever-sharper image of what’s in store for the coming year.

The Future of Accelerator Programming

Thanks to their performance, energy efficiency, and affordability, accelerators have taken on leading role in PCs, mobile devices and supercomputers alike. And as the technology became more ubiquitous, programs had to be written or retooled to take advantage of the heterogenous architecture. But accelerator programming is not without its difficulties, as contributing writers Kamil Rocki and Martin Burtscher show us in this feature article that helped to kick off 2014.
Read more…

IBM’s New Deal Captures Refocused HPC Strategy

With the sale of its x86 business to Lenovo, 2014 marked a year of questions about what would be in store for IBM and its HPC customers. We sat down with Dave Turek, IBM’s Vice Predisent of Advanced Computing to find out more about Big Blue’s grand plan for HPC’s future.
Read more…

How HPC is Hacking Hadoop

Although 2014 started out with only a few instances where Hadoop and HPC were beginning to overlap, the trend of big data in supercomputing became one of the year’s prevailing themes. Join us as we take a closer look at the research that is making this merger possible, taking a closer look at: Hadoop for data-intensive science, adapting MapReduce to an HPC environment, exploring it across different parallel file systems, handling scheduling and more.
Read more…

Intel Etches HPC Niche with Xeon E7 V2

When Intel pulled back the curtain on the processors that would follow the Westmere line, the new Xeon E7 v2 series looked primed for enterprise big data and analytics. But with its 22 nanometer process, memory and AVX instruction capabilities, as well as its focus on reliability, the new processor looks to be better suited to HPC than one would first expect.
Read more…

Details Emerging on Japan’s Future Exascale System

Japan’s K Computer—a $1.38 billion exascale system slated for installation in 2019 and production in 2020–is now underway.
According to the roadmap put forth by Yoshio Kawaguchi from Japan’s Office for Promotion of Computing Science/MEXT, basic development for the future system now taking a close look at software, accelerator, processor and scientific project planning fronts.
Read more…

Inside Major League Baseball’s “Hypothesis Machine”

With over 140 years’ worth of box scores, batting averages and RBIs, data has always been an essential part of baseball culture. But as that data goes digital, leading MLB managers and scouts are investing in analytics and even supercomputing to help them gain a competitive edge.
Read more…

Big Science; Tiny Microservers: IBM Research Pushes 64-Bit Possibilities

When a friend dropped a Sheeva Plug into the hands of Ronald Luijten, a system designer at IBM Research in Zurich, neither could have imagined the development cycles that moment would set in motion.
Requiring remarkably low wattage, the Sheeva Plug was a key ingredient for IBM and SKA/ASTRON to begin a journey away from power-hungry systems and into the realm of microservers.
Read more…

What Power8 and OpenPOWER Might Mean for HPC

In the aftermath of its deal to sell off its x86 servers to Lenovo, IBM has a new heading to hybridization as it looks to join its POWER8 processors with accelerators and high-speed networking, as well as opening up its chip and system software through the OpenPOWER Foundation. While Big Blue has placed an emphasis on hyperscale Web applications and data analytics, we’re taking a closer look at how the new technology will undoubtedly fit right into the realm of traditional HPC as well.
Read more…

Quantum Processor Hits 99.9 Percent Reliability Target

In April of this past year, physicists at UC Santa Barbara leapt toward the goal of a fully functional quantum computer in an effort that could be key to breaking through Moore’s Law. While this step forward is admittedly small scale–featuring superconducting qubits–the advancement could eventually enable large-scale, fault-tolerant quantum circuits.
Read more…

Detailed Results from 2014’s TOP500 Fastest Supercomputers List

In 2014, it came as no surprise that the Chinese Tianhe-2 system, which blew all others out of the water when it was announced the year before, easily defended its peak spot on the TOP500 list. But perhaps even more notably, with the exception of a 3.14-petaflop Cray XC30 at an undisclosed U.S. government site, there were no shakeups in the list’s top ten at all.
What’s making news instead is an emerging trend that translates as lost list share for the United States. Total US contributions to the TOP500 fell from 265 to 233 – while the number of Chinese systems only grows.
Read more…

First Details Emerge from Cray on Trinity Supercomputer

In one of the largest awards in the company’s history, Cray has signed a $174 million deal with the National Nuclear Security Administration (NNSA) for a next-generation Cray XC machine called Trinity. Aimed at handling the agency’s nuclear stockpile by simulating maintenance, degradation and even destruction of reserves, the system will reportedly be equipped with an 82-petabyte capacity Cray Sonexion storage system, and is proposed to be capable of 30 petaflops.
Read more…

Is This the Exascale Breakthrough We’ve Been Waiting for?

Last year, UK-based startup Optalysys posted a video that made waves, detailing a prototype optical processor that claims it lap its electronic counterparts while slashing cost and power consumption. Not only that, but the company says it is on track to deliver exascale levels of processing power on a standard-sized desktop computer within the next few years.
Read more…

Jellyfish Use Novel Search Strategy

As we push to give computer intelligence human qualities, researchers are finding evidence of computing-like behavior in the animal kingdom. According to a paper published in August, 2014, a UK scientist has discovered that certain species of jellyfish finds food by using a supercomputing algorithm called “fast simulated annealing.”
Read more…

Will 2015 Be the Year of the FPGA?

FPGAs, while still outshined by the flops offered by GPUs and Xeon Phi, are poised to branch out into wider markets and gain traction for those focused on the “three Ps,” thanks to their increasingly attractive price, performance, and programmability. And as big data finds a home in HPC, FPGAs are taking root in search-centric areas, whether it’s a simple web, or search and discovery applications in bioinformatics, security, image recognition and beyond.
Read more…

Intel ‘Haswell’ Xeon E5s Aimed Squarely at HPC

As Intel rolled out its “Haswell” Xeon E5-2600 v3 processors for servers and workstation, it became clear that the chipmaker plans to keep the line firmly rooted in HPC, as evidenced by new features that are primed to deliver a performance boost for modeling and simulation tools.
Read more…

CORAL Signals New Dawn for Exascale Ambitions

For those who thought the architecture for next-wave supercomputers was set, the upcoming CORAL systems – pre-exascale contracts from the DOE worth $325 million – are bound to shake up expectations.
What’s particularly unique about the systems (beyond the scope of the DOEs investment) is that they could change how national labs approach energy consumption, prioritization of scientific and security challenges, and perhaps even the US’s current position as national supercomputing power.
Read more…

Supercomputing Wrap: Top Stories from SC14

For anyone present at SC14, it should come as no surprise that the CORAL systems were the top announcements at SC14, followed closely by IBM and the future of Power systems for HPC. According to IDC analysts, IBM and Lenovo’s reemergence as dominant players in the supercomputing industry could soon mean big changes for the TOP500 list as both companies look poised to topple HP from its top spot.
Read more…

Dell in Talks to Purchase EMC

The New York Times is reporting Dell Inc. is in advanced talks to buy EMC (NYSE: EMC). According to the report (Dell Is Said to Be in Talks to Acquire EMC) the deal would likely involve mostly cash and the agreement could be reached within a week, although these people also cautioned that talks were continuing and might still fall apart.
Here’s an excerpt from the New York Times article:
“A takeover of EMC would answer years of questions about the fate of the 36-year-old company, which has grown from data storage into a collection of businesses, including network security and content management for corporate clients. Critics have contended that EMC’s vaunted “federation” business model has produced little benefit for shareholders in recent years, with some investors calling for a breakup. Such demands have been largely rebuffed by EMC’s longtime chairman and chief executive, Joseph M. Tucci, who is 67.”

“EMC’s core storage business, meanwhile, has faced pressure from rivals, including the likes of Pure Storage, a six-year-old company that went public on Wednesday. Much of EMC’s value, moreover, comes from its 80 percent ownership of VMware, which makes so-called virtualization software, used to make efficient modern computing systems. VMware is worth $34.3 billion, making EMC’s stake worth $27.4 billion. EMC also has 98 percent of the voting shares in VMware.”
Read the full New York Times article: http://www.nytimes.com/2015/10/08/business/dealbook/dell-is-said-to-be-in-talks-to-acquire-emc.html?_r=0

Tuesday, October 6, 2015

Filter Bam File Using PySam

Question:

I will process a bam or sam file that it only contains reads with a 
specific type of Mutations (for example the mismatch is T on reference 
Genome and C on the Reads).
Can someone help me please? 



Answer:

 

The MD tag of an aligned read contains a string encoding all mismatches 
along with their position in the read. You could scan the MD tags for 
T's and then check if the read sequence has a C at that position.


To do this in Python, install pysam (http://code.google.com/p/pysam/) and start with:

import pysam

nuclRef = "T"
nuclRead = "C"

bam = pysam.Samfile("your.bam")

for read in bam.fetch():
    if read.is_unmapped:
        continue
    md = read.opt("MD")
    if nuclRef in md:
        print md, read.seq