Tuesday, June 30, 2015

CFEngine Resources courtesy of Vertical Sysadmin, Inc.

Publications

Videos

Articles

CFEngine Resources from Vertical Sysadmin

Purpose: There is a lot to know about CFEngine, which can make it hard for people new to the subject.  The purpose of this guide is to lay out the resources available to CFEngine students and to orient them to this body of knowledge to speed their journey into practical system automation with CFEngine 3.
The guide is based largely on the materials of cfengine.com with full gratitude to Mark Burgess for continuously raising the bar in the field of system administration.
Comments welcome, please email me.
Getting Started
  • CFEngine Quickstart: a Quick Start Guide.  Build and install from source; or install a package.  Then either set up CFEngine to run from cron; or start learning by running individual examples.
  • CFEngine 3 Concept Guide: an abbreviated version of the CFEngine tutorial.  Topics include: Introduction – System automation; The components of CFEngine; Bodies and bundles; A simple crash course in concepts; Knowledge Management.
Core Documentation
Learning CFEngine
  • Mark Burgess’s Introduction to CFEngine 3: four videos comprising Mark’s day class on CFEngine (find them at the top of the linked Training page):
    1. Introduction and motivation
    2. Understanding patterns and knowledge
    3. Client-Server basics
    4. Recap and the CFEngine landscape
  • CFEngine 3 Practical Examples: A collection of practical examples to help learn CFEngine 3.  Use “ls -1″ to display them in order (they are arranged from basic to more advanced).
  • CFEngine 3 Cookbook: A growing collection of practical examples well explained.  Neil’s writing helped many sysadmins start with CFEngine 2 and 3.
CFEngine Policy Source Code Libraries
Special Topics
  • There is a large (and growing) number of guides on various topics: devops, file editing, adopting CFEngine, etc.  Check Special Topics for the full list.
Enterprise Edition
  • Nova Quick Start Guide: helps you install Nova and gives an overview of the beautiful admin GUI (the “Mission Portal”).
  • CFEngine Nova Technical Supplement: details Nova installation, admin GUI, and Nova capabilities:
    • Business integration
    • Monitoring extensions
    • Database control
    • File ACLs
    • Server extensions
    • Environments and workflows
    • Virtualization
    • Content-Driven Policies (Community Edition can do this too)
    • Windows-specific features
Demo Videos
Video presentations demonstrating key capabilities of CFEngine Community and Enterprise editions.  The below content is straight from www.cfengine.com:

CFEngine and Change Detection
CFEngine and Change Detection – CFEngine offers extensive tripwire functionality. Combined with CFEngine auto-repair functionality you can ensure policy compliance on files and directories. In this video you will see how CFEngine detects file creation, file change and file deletion. Use CFEngine to secure your files and prevent unauthorized changes. With CFEngine you can keep track of all configuration changes and view them in detailed reports.
CFEngine and Apache Webserver
CFEngine and Apache Webserver – This demo shows you how CFEngine can manage one Apache webserver to ensure server availability and uptime. Use policies to define how to manage your webservers, and CFEngine will make sure your webserver are compliant with your policies. Avoid any downtime due to misconfigured or accidentally deleted files.
CFEngine and Windows Registry
CFEngine and Windows Registry – This demo shows you how CFEngine can manage the Windows registry. With CFEngine you can make sure the registry always stays compliant with your policies. Manage the whole registry or just specific keys and / or key-values.
CFEngine and PXE Boot
CFEngine and PXE Boot – CFEngine manages your servers and networks of machines throughout the life-cycle. Use CFEngine during the build and deploy phase by creating PXE-boot servers controlled and deployed by CFEngine. Based on policies you can turn any server into whatever configuration setting you like, independent of operating system and required services. This demo will show how to create a CFEngine PXE-boot server and then install Redhat on a clean machine using CFEngine. PS: This demo has sound.
The Orion Cloud Pack
The Orion Cloud Pack – Three Steps you need to bring reliability and efficiency to Managed Services running out of the Amazon Cloud. Set up and tear down machines as you like, and bring instant configuration and compliance, with self-healing to your business.
CFEngine Computer-Process Management
CFEngine Computer-Process Management – Use CFEngine to manage your computer-processes. Use application-availability policies to ensure uptime at all times. CFEngine starts, deletes and/or restart processes, all according to your policies. In this video we will show how CFEngine restarts a broken Apache-server. The demo will show how CFEngine can kill a process, and how it automatically (the CFEngine Agent) restarts the same process.
CFEngine DNS Resolver
CFEngine DNS Resolver – This clip shows how cfengine can deal with a very common issue in network configuration: setting up the name-server bindings. DNS configuration is sometimes done by DHCP, but statically configured servers can be maintained by CFEngine directly. We show how CFEngine repairs a damaged configuration file to ensure the correct settings.
Misc.

Friday, June 26, 2015

clobber

mv --no-clobber option

Problem:

I'm writing a script in which I need to move files from one directory to another. However, some of the files have the same name, and I want to keep the older file, so I can't use --update. I'm moving files from the newer directory to the older directory, so what I'm looking for is a way to automatically not overwrite. Basically, I need the behavior of mv with the opposite of --force option. I can't use the --interactive option either, because I'm copying multiple files and I don't want mv to hang.

There's no reason I must use mv, I just assumed it'd be the easiest way to accomplish what I need. If there's an easier way that doesn't involve mv I'm open for suggestions.

After searching around a while I found this recent webpage which makes it seem as though mv will now have a --no-clobber option which will do exactly what I need. I'm running Ubuntu on this computer, so I'm sure the webpage is relevant, but mv doesn't like --no-clobber despite the fact that my system is updated.

So basically what I need is explained in the first paragraph. I want a script to move files from one directory to another and automatically NOT overwrite: the oppsite of --force.

Thanks in advance for any help or suggestions!
 
Solution:
 
Code:
fromdir=/path/to/original/files
destdir=/path/to/destination/directory

cd "$fromdir" || exit 1

for file in *
do
  [ -f "$destdir/$file" ] || mv "$file" "$destdir"
done

 The above code will work nicely.

The other possible solution is: 
 
bash: set -o clobber 
 
which will deny overwriting any files.

How to Flush Memory Cache on Linux Server

Some times we found a low memory of Linux systems running a while. The reason is that Linux uses so much memory for disk cache is because the RAM is wasted, if it isn’t used. Cache is used to keep data to use frequently by operating system. Reading data from cache if 1000’s time faster than reading data from hard drive.
Its good for os to get data from cache in memory. But if any data not found in the cache, it reads from hard disk. So its no problem to flush cache memory. This article have details about how to Flush Memory Cache on Linux Server.
Empty Linux Buffer Cache:
There are three options available to flush cache of linux memeory. Use one of below as per your requirements.
1. To free pagecache, dentries and inodes in cache memory
# sync; echo 3 > /proc/sys/vm/drop_caches
2. To free dentries and inodes use following command
# sync; echo 2 > /proc/sys/vm/drop_caches
3. To free pagecache only use following command
# sync; echo 1 > /proc/sys/vm/drop_caches
Setup Cron to Flush Cache Regularly
Its a good idea to schedule following in crontab to automatically flushin cache on regular interval.
# crontab -e
0 * * *  * sync; echo 3 > /proc/sys/vm/drop_caches
The above cron will execute on every hour and flushes the cached memory on system.
Find Cache Memory uses in Linux
Use free command to find out cache memory uses by Linux system. Output of free command is like below
# free -m
Sample Output
             total       used       free     shared    buffers     cached
Mem:           992        406        586          0        155        134
-/+ buffers/cache:        116        876
Swap:         2015          0       2015
Last column is showing cached memory ( 134 MB) by system. -m option is used for showing memory details in MB’s.


https://tpc.googlesyndication.com/simgad/7551346079755906386

Thursday, June 18, 2015

Big Data Tutorial 1: MapReduce

Go to start of metadata
What is  dumbo and how to ssh into dumbo for running jobs?
Dumbo is the stand alone Hadoop cluster running on the Hortonworks Data platform. It can be used to perform various mapreduce jobs for big data analytics.

To access dumbo: I recommend using the Mac OS in the class room. The PC's provided have a Mac OS option. 

Please follow the instructions on this link:
                                                                       or
MAC OS users Only
Make sure to follow the instructions for Web UI access using the above like

Icon
  • cd /Users/NetID
  • mkdir .ssh
  • cd .ssh
  • touch config
  • vi config

(for mac users only) Copy and paste the below into /.ssh/config/ 
Icon
Host hpctunnel
      HostName hpc.nyu.edu
      ForwardX11 yes
      LocalForward 8025 dumbo.es.its.nyu.edu:22
      User NetID
DynamicForward 8118
Host dumbo
           HostName localhost
           Port 8025
           ForwardX11 yes
           User NetID

What is Hadoop?
Hadoop is an open-source software framework for storing and processing big data in a distributed/parallel fashion on large clusters of commodity hardware. Essentially, it accomplishes two tasks: massive data storage and faster processing. The core Hadoop consists of HDFS and Hadoop's implementation of MapReduce.

What is HDFS?

HDFS stands for Hadoop Distributed File System. HDFS is a highly fault-tolerant file system and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
What is Map-Reduce?
MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.
A MapReduce job splits a large data set into independent chunks and organizes them into key, value pairs for parallel processing. A key-value pair (KVP) is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. The mapping and reducing functions receive not just values, but (key, value) pairs.This parallel processing improves the speed and reliability of the cluster, returning solutions more quickly and with greater reliability. 



Every MapReduce job consists of at-least three parts:
  • The driver 
  • The Mapper 
  • The Reducer 

Mapping Phase
The first phase of a MapReduce program is called mapping. A list of data elements are provided, one at a time, to a function called the Mapper, which transforms each element individually to an output data element.
The Map function divides the input into ranges by the InputFormat and creates a map task for each range in the input. The JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.

 
Mapping creates a new output list by applying a function to individual elements of an input list.
Reducing Phase
Reducing let's you aggregate values together. A reducer function receives an iterator of input values from an input list. It then combines these values together, returning a single output value.
The Reduce function then collects the various results and combines them to answer the larger problem that the master node needs to solve. Each reduce pulls the relevant partition from the machines where the maps executed, then writes its output back into HDFS. Thus, the reduce is able to collect the data from all of the maps for the keys and combine them to solve the problem.

 
Reducing a list iterates over the input values to produce an aggregate value as output.

 MapReduce Data Flow


mapreduce-process


What are the components of the dumbo Cluster @NYU and what can they be used for?
Lets see the UIs for a better understanding:
Commands for HDFS & MapReduce:

Transferring Data from workstation to dumbo :

Icon
my_workstation$ scp my_file dumbo:
my_workstation$ ssh dumbo


HDFS COMMANDs

TO UPLOAD DATA TO HDFS

Icon
hadoop fs  -put   <filename_in_lfs>  <hdfs_name>
                                       or
hadoop fs  -copyFromLocal  <filename_in_lfs>  <hdfs_name>
or
hdfs dfs -put   <filename_in_lfs>  <hdfs_name>


TO GET DATA FROM HDFS

Icon
hadoop fs  -get     <hdfs_name>  <filename_in_lfs>
                                         or
hadoop fs  -copyToLocal   <hdfs_name>  <filename_in_lfs>


TO CHECK HDFS FOR YOUR FILE
Icon
hadoop fs -ls


MAPREDUCE COMMANDS

TO COMPILE JAVA FILES

Icon
javac -cp $(yarn classpath) my_code.java
or
javac -classpath /share/apps/examples/Tutorial1/hadoop-core-1.2.1.jar   *.java


TO MAKE THE JAR FILE
Icon
jar cvf <jarfilename>.jar *.class

TO TRIGGER THE JOB
Icon
hadoop jar <jarfilename>.jar <DriverClassName> <ip_file_in_HDFS> <op_dir_name>

TO CHECK RUNNING JOB
Icon
hadoop job -list

TO KILL THE JOB
Icon
hadoop job -kill <job_id>




Example Map-Reduce job:

  1. Word Count: The objective here is to count the number of occurrences of each word by using key-value pairs.
Step 1:
ssh into dumbo

Step 2:  
Move to
Icon
cd /share/apps/examples/Tutorial1/example1
It includes 4 files
Example.txt ------ Input file
SumReducer.java  ------ This is the reducer
WordMapper.java ------ This is the mapper
WordCount.java  ------- This is the driver
WordCount.jar - Complied jar file used to run the mapreduce job

Step 3:
Copy example1 folder to /home/user/example1   
Icon
cp -r example1 /home/netid/

Step 3:
Place the example.txt file on to hdfs 
Icon
hadoop fs -put example.txt example.txt
Step 4:
Run the mapreduce job using WordCount.jar
Icon
hadoop jar example.jar wordcount example.txt wordcountoutput
Step 5:
Check output by accessing HDFS directories
Icon
hadoop fs -get wordcountoutput


      2. Standard Deviation : The objective is to find the standard deviation of the length of the words.
Step 1: 

Move to 
Icon
cd /share/apps/examples/Tutorial1/example2
example2.txt - Input file
StandardDeviation.jar - compiled jar file

Step 2: 
copy example2 folder to /home/user/example2
Icon
cp -r example2 /home/netid/

Step 3:
Place the example2.txt file on to hdfs 
Icon
Hadoop fs -put example2.txt example2.txt

Step 4:
Run the mapreduce job using StandardDeviation.jar
Icon
hadoop jar StandardDeviation.jar wordstandarddeviation example2.txt standarddeviationoutput

Step 5:
Check output by accessing HDFS directories
Icon
hadoop fs -get standarddeviationoutput

     3.   Sudoku Solver : The objective is to solve the given sudoku puzzle by using mapreduce
Step 1:  
Move to
Icon
cd /share/apps/examples/Tutorial1/example3
Sudoku.dft - Puzzle
sudoku.jar - Compiled jar file

Step 2: 
copy example3 folder to /home/user/
Icon
cp -r example3 /home/netid/

Step 3:
Run the mapreduce job 
Icon
hadoop jar sudoku.jar sudoku.dft

(Note: Twitter Sentiment analysis can be done using this cluster. It requires the use of java for mapreduce and pig script for sorting the twitter users based on number of tweets. The next steps would be setting up oozie workflow and observe the analysis on Hue. To learn more about sentiment analysis please contact hpc@nyu.edu)

MapReduce Streaming

Even though the Hadoop framework is written in Java, programs for Hadoop need not to be coded in Java but can also be developed in other languages like Python, shell scripts or C++. Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. 
Streaming runs a MapReduce Job from the command line. You specify a map script, a reduce script, an input and an output. Streaming takes care of the Map Reduce details such as making sure that your job is split into separate tasks, that the map tasks are executed where the data is storedt Hadoop Streaming works a little differently (your program is not presented with one record at a time, you have to iterate yourself)
          • -input – The data in hdfs that you want to process
          • -output – The directory in hdfs where you want to store the output
          • -map script – the program script command line or process that you want to use for your mapper
          • -reduce script – the program script command or process that you want to use for your reducer.
            Icon
            The streaming jar is located at /share/apps/examples/Tutorial1/hadoop-streaming-2.6.0.2.2.0.0-2041.jar on dumbo.

            Command used to run a mapreduce job using streaming:
            Icon
            hadoop jar /share/apps/examples/Tutorial1/hadoop-streaming-2.6.0.2.2.0.0-2041.jar -input streamingexample.txt -output streamout1 -mapper mapper.py -reducer reducer.py -numReduceTasks 2

            (Note: R can also be used to run mapreduce jobs. Please contact hpc@nyu.edu to learn more)