Cloud Security

Secure your Virtual Machines in the cloud by doing the following:

  1. Reduce your open ports to as few as possible.
  2. Do NOT allow any process serving as a TCP/IP listener to run as root.
  3. Change your SSH port from 22 to something other than 22.
    1. Change this every day if you are paranoid.
    2. Change this every hour if your are crazy paranoid.
  4. Use only Public Key Encryption for SSH access.
    1. Change your Public Keys every day if you are paranoid.
    2. Change your Public Keys every hour if you are crazy paranoid.
    3. Use only 2048 bit keys whether paranoid or not.
  5. Deny root level access for all users other than via the console and ensure the console requires physical access in a secure building.
    1. Deny root level access completely if paranoid.
  6. Encrypt your disks – this keeps those who may steal your VM Image from being able to use it.

Hackers will exploit those details you have neglected !!!

Leave too many ports open with root level access and YOU will be hacked !!!

Make things too convenient for your own use and YOU will be hacked !!!

Remain ignorant of how hackers work and YOU will be hacked !!!

Be lazy and stupid and YOU will be hacked !!!




Salivating over the idea of using Amazon EC2 GPU Instances for Map-Reduce via Cuda 4.x

Cost is only about $2.10 per hour ($1500+ per month) but just think of how cool a Map-Reduce process might be with all those Cuda Cores churning away on the data.

Amazon offers GPU Instances for something like $1540 per month; Tesla GPUs (2x NVIDIA Tesla “Fermi” M2050 GPU has 448×2 Cuda Cores) can do a whole lot of Map-Reduce for the money; one might not need very many of these connected to compressed EBS to realize some potential for some form of Analytics process, for instance.  The real question is just how many customers can a single Amazon GPU Instance handle before we need to scale and then what is the real scaling cost going forward.  Each customer’s data could be stored in a separate compressed EBS Volume all handled by a single Amazon GPU Instance or Amazon GPU Cluster.  We might be able to greatly reduce cost while providing very good performance across the board.

The storage cost might be reduced by using a compressed file system for Linux, like fusecompress or we might build a custom solution using FUSE.  Seems like a fair number of people are using fusecompress for Ubuntu, might also be available for Red Hat.  This could help to mitigate the Amazon EBS cost.  All that remains is checking to see how we might leverage micro instances at Amazon to store the Hadoop data combined with a real-time solution for gathering aggregations – who knows this could reduce deployment costs by a fair margin.

Keep in mind raw data can easily be compressed by a factor of 1/30th the original size.

I may even go out and buy a slick Nvidia 560 Ti or the like just to get my hands on something close to 500 Cuda Cores rather than the 64 I have in my 2+ yr old laptop that just happens to have a dual GPU SLI setup; not bad for development and testing.

On-Demand High-Power Map-Reduce via NVidia CUDA and Stackless Python might just be too much to let sit idle… might have to begin paying Amazon for some time on their Massive GPU Instances just for fun some weekend.

MySQL for BigData

If all you have to work with is MySQL but you have PetaBytes to store… you could be in trouble unless… you happen to be me…

Assumption #1

Relational databases love executing really small SQL Statements.

Assumption #2

Relational databases do NOT have to use any relational features.

Assumption #3

Networked Object-Oriented data models are very efficient when all you have to work with is a Relational Db as the data management platform.

Assumption #4

BigData solutions tend to use really big heaps of key/value storage systems because the data can be spread-out over a large number of modes easily.

Assumption #5

Many instances of MySQL can execute the same query faster than a single instance because the distributed query can be executed in parallel.

Assumption #6

Forget everything you ever thought you knew about how to cluster MySQL because all that crap won’t help you when you have PetaBytes to store and manage efficiently.

Solution #1

Store your BigData in many instances of MySQL (think 10’s or 100’s) using a Networked Object-Oriented Data Model where key/value pairs are linked to form objects using nothing but Metadata in the form of key/value pairs while spreading the data out to all available MySQL nodes and then execute the SQL required to retrieve Collections of Objects in parallel and MySQL can be nice and fast for BigData.

Caveat #1

Do you know what is meant by “Networked Object-Oriented Data Model” ?!?  Probably not but this gives you something to figure-out while looking for all those cheap computers you will use to form your MySQL Network.

Caveat #2

Do you know what is meant by “executing the same SQL Statement in Parallel” ?!?  Probably not but this gives you something to figure-out while you think about the prior Caveats.

Caveat #3

Do you know the process of fetching data from all those MySQL Instances can be done using a single SQL Statement ?!?  Probably not, but then you probably forgot to read-over and understand Assumption #6 from above.  Think about Collections of Objects more than Rows of Data.

Caveat #4

Keep it super-simple.  Super-Simple runs faster than the other thing.

Computers are really stupid but can be fast.

Stupid requires simple.

Simple is FAST.

BigData is FAST when the solution is parallel but stupid simple.

Caveat #5

Try to optimize each MySQL Instance by increasing the available RAM to a minimum of 4 GB per instance using 32-bit MySQL running in a 32-bit Linux OS but use VmWare Workstation to run each instance using a separate CPU Core with a minimum of 1 VmWare Workstation Instance per CPU Core.  Unless you can find a MySQL Implementation that automatically uses multiple cores and then you have to give some serious thought to how to make all them MySQL Instances execute the same SQL Statements in parallel – better think about this one for a while… I already know how to do this but you might not.


HADOOP Optimization Technique #1

HADOOP is slow !

BigData should be FAST !

Single Server installations for HADOOP tend to want to use the entire multi-core CPU for one single HADOOP instance.

Assumption #1

The Java JVM has NOT been optimized for multiple cores for anything other than garbage collection when one uses an out of the box JRE.

Assumption #2

The HADOOP has NOT been optimized for multiple cores for anything other than garbage collection based on Assumption #1.

Assumption #3

Most servers HADOOP might run on probably have multiple cores especially when Intel or AMD chips are being used due to the need to keep Moore’s Law alive in a Universe where the upper bound for CPU performance is the RAM bus speed.

Assumption #4

VmWare Workstation Appliances can be run each using a separate core when the host OS is Windows Server 2008 R2.

Assumption #5

VmWare Workstation Appliance Instances will be run at the HIGH Priority setting (one level below Real-time for Windows Server 2008 R2).

Assumption #6

VmWare Workstation Appliance Instances will be given 4 GB RAM using 32-bit HADOOP in a 32-bit Linux OS; all software being used is 32-bit.  No 64-bit code will be used.

Possible Solution #1

If the server has 4 cores when run 4 instances of HADOOP each in a separate VmWare Appliance where each VmWare Workstation instance is dedicated to one of the available cores.

Scale for the number of cores.

Continue packing-in separate VmWare Instances using VmWare Workstation until the aggregate performance begins to degrade and then use empirical performance data to determine the optimal configuration.

Caveat #1

Solution #1 has not yet been tried however based on the available information it should produce better performance for HADOOP and/or Java in general.


%d bloggers like this: