vSAN Cluster Shutdown

A few weeks ago I had to shutdown a vSAN Cluster temporarily for a planned site-wide 24 hour power outage that was blacking out a datacentre. With the amount of warning and a multi-datacentre design this wasn’t an issue, but I made use of vSphere tags and some Powershell/PowerCLI to help with the evacuation and repopulation of the affected cluster. Hopefully some of this may be useful to others.

The infrastructure has two vSAN Clusters – Cluster-Alpha and Cluster-Beta. Cluster-Beta was the one being affected by the power outage, and there was sufficient space on Cluster-Alpha to absorb migrated workloads. Whilst they exist in different datacentres both clusters are on the same LAN and under the same vCenter.

I divided the VMs on Cluster-Beta into three categories:

  1. Powered-Off VMs and Templates. These were to stay in place, they would be inaccessible for the outage but I determined this wouldn’t present any issues.
  2. VMs which needed to migrate and stay on. These were tagged with the vSphere tag “July2019Migrate”
  3. VMs which needed to be powered off but not migrated. For example test/dev boxes which were not required for the duration. These were tagged with “July2019NOMigrate”

The tagging was important, not only to make sure I knew what was migrating and what was staying, but also what we needed to move back or power on once the electrical work had completed. PowerCLI was used to check that all powered-on VMs in Cluster-Beta were tagged one way or another.

Get the VMs in CLuster-Beta where the tag “July2019Migrate” is not assigned and the tag “July 2019NOMigrate” is not assigned and the VM is Powered On.

Get-Cluster -Name "Cluster-Beta" |Get-VM | where {
 (Get-TagAssignment -Entity $_).Tag.Name –notcontains "July2019Migrate" –and
 (Get-TagAssignment -Entity $_).Tag.Name –notcontains "July2019NOMigrate" –and
 $_.PowerState –eq “PoweredOn”}

In the week approaching the shutdown the migration was kicked off:

#Create a List of the VMs in the Source Cluster which are tagged to migrate
$MyTag= Get-Tag -Name "July2019Migrate"
$MyVMs=Get-Cluster "Cluster-Beta" | Get-VM | Where-Object {(Get-TagAssignment -Entity $_).Tags -contains $MyTag }
#Do the Migration
$TargetCluster= "Cluster-Alpha" #Target Cluster
$TargetDatastore= "vSANDatastore-Alpha" #Target Datastore on Target Cluster
$MyVMs | Move-VM -Destination (Get-Cluster -Name $TargetCluster) -Datastore (Get-Datastore -Name $TargetDatastore) -DiskStorageFormat Thin -VMotionPriority High

At shutdown time, a quick final check of the remaining powered on VMs was done and then all remaining VMs in Cluster-Beta were shut down. Once there were no running workloads on Beta it was time to shut down the vSAN cluster. This part I didn’t automate as I’m not planning on doing it a lot, and there’s comprehensive documentation in the VMware Docs site. The process is basically one of putting all the hosts into maintenance mode and then once the whole cluster is done, powering them off.

You are in a dark, quiet datacentre. There are many servers, all alike. There may be Grues here.

When power was restored, the process was largely reversed. I powered on the switches providing the network interconnect between the nodes, and then powered on those vSAN hosts and waited for them to come up. Once all the hosts were visible to vCenter, it was just a case of selecting them all and choosing “Exit Maintenance Mode”

2019-07-29 (8)

There was a momentary flash of alerts as nodes come up and wonder where their friends are, but in under a minute the cluster was passing the vSAN Health Check

image

At this point it was all ready to power on the VMs that had been shutdown and left on the cluster, and vMotion the migrated virtual machines back across. Again, PowerCLI simplified this process:

#Create a List of the VMs in the Source Cluster which are tagged to stay but need powering on.
$MyTag= Get-Tag -Name "July2019NOMigrate"
$MyVMs=Get-Cluster “Cluster-Alpha” | Get-VM | Where-Object {(Get-TagAssignment -Entity $_).Tags -contains $MyTag }
#Power on those VMs
$MyVMs | Start-VM

#Create a List of the VMs in the Source Cluster which are tagged to migrate (back)
$MyTag= Get-Tag -Name "July2019Migrate"
$MyVMs=Get-Cluster “Cluster-Alpha” | Get-VM | Where-Object {(Get-TagAssignment -Entity $_).Tags -contains $MyTag }
#Do the Migration
$TargetCluster= "Cluster-Beta" #New Target Cluster
$TargetDatastore= "vSANDatastore-Beta" #Target Datastore on Target Cluster
$MyVMs | Move-VM -Destination (Get-Cluster -Name $TargetCluster) -Datastore (Get-Datastore -Name $TargetDatastore) -DiskStorageFormat Thin -VMotionPriority High

Then it was just a case of waiting for the data to flow across the network and finally check that everything had migrated successfully and normality had been restored.

we have normality, I repeat we have normality…Anything you still can’t cope with is therefore your own problem. Please relax.

Trillian, via the keyboard of Douglas Adams. The Hitchhiker’s Guide to the Galaxy

Microsoft Azure Fundamentals

Earlier this week I took and passed the Microsoft AZ-900 exam- the requirement for the Microsoft Azure Fundamentals badge. Whilst this is the entry-level cert and not a requirement for the more advanced ones in the pathways, it is still useful for experienced techies moving into the Azure space, perhaps from other cloud platforms or on-premises architectures.

I had some prior experience dabbling in Azure, so wasn’t coming into this green. This was coupled with my general experience in server and cloud technologies so the generic concepts weren’t new to me. But I personally found working to the certification a useful way of ensuring I have a good grounding in the platform and specific terminologies before moving onto other things.

azure-fundamentals-600x600

Learning Materials

There’s plenty of material out there, including a new book, but I studied by going through the free online “Azure fundamentals” learning path from Microsoft: https://docs.microsoft.com/en-us/learn/paths/azure-fundamentals/

This is a series of articles, short videos, and mini-tests with a couple of practical exercises in the Azure Portal thrown in. It covers everything from the basics (what is cloud computing etc.) through to some of the specific Azure products which are in the exam syllabus.

I coupled this learning pathway with some further exploration and experimentation in the Azure Portal and associated documentation.

The Exam

The exam itself- once all the pre-ramble, surveys, and commenting sections are removed – is 60 minutes and my test had (IIRC) 44 questions. The first 5/6 questions were in a separate section where I couldn’t revisit them once I’d answered and moved on, but for the remainder I was able to go back and forth and review as necessary.

Question Styles

No, I’m not going to reveal what questions I was asked, but knowing the way the questions can be asked in advance is helpful. It’s been some years since I took a Microsoft exam and question styles change.

My particular test (and remember, they’re all different) had a mixture of multiple choice (sometimes just one of four answers, sometimes more than one may have been required) and drag-and-drop answers. Within the multiple choice there were also a number of questions where I was given a statement and had to replace (if necessary) the words given in bold. For example (and this is obviously not a real AZ-900 question!)

The Microsoft Solitaire game was first released with Windows 95 to help introduce the graphical user interface.

Review the text in bold. If it makes the statement correct select “No change needed”, otherwise select the answer which makes the statement correct.

  • A-No change needed
  • B-Windows 3.0
  • C-Windows Bob
  • D-OS X

Correct Answer- B

Check out the “Exam formats and question types” videos from Microsoft for more detail.

Subjects Covered

The subjects are fully covered in the “Skills Measured” section of the exam webpage and I felt there was a good match between these lists and the questions I was posed on the day.

Going into this with a firm background in the generic cloud concepts the trickiest part for me was matching up which Azure product does what and remembering the names. I’d recommend making sure you’ve remembered as many as possible of these from the core offerings- and also be prepared to spot fake product names in the multiple choice (I’m pretty sure I saw a few of these). For example (and again, not a real AZ-900 question!)

In Microsoft Windows 10, which application could you use to assign local administrator rights to an Active Directory user.

  • A-Active Directory Users and Computers
  • B-Local Users and Groups
  • C-Windows Administrator Control
  • D-Microsoft Rights Manager

Correct Answer- B. As far as I know C and D don’t exist.

Conclusion

I’m obviously happy with my pass, and I’d recommend looking at this to anyone starting out on an Azure journey, possibly from scratch or by transitioning from other technologies. The exam isn’t compulsory, but it does validate your learning- either to yourself or your employer.

Hyper-Converged Cynicism

Or “How I’ve come to love my vSAN Ready Nodes”

I’ll admit it, some years ago I was very cynical about HyperConverged Infrastructure (HCI). Outside of VDI workloads I couldn’t see how it would fit in my environment – and this was all down to the scaling model.

With the building-block architecture of HCI; storage, compute, and memory are all expanded in a linear fashion. Adding an extra host to the cluster to expand the storage capacity also increases the available memory and CPU in the pool of resources. But my workloads were varied, one day we might get a new storage-intensive application, the next week it might be one which is memory intensive. I was used to independently expanding the storage through a SAN and just the compute/memory side through the servers and didn’t want to be either running up against a capacity wall or purchasing unnecessary compute just to cater for storage demands.

This opinion changed when my own HCI journey started in 2017 with the purchase of a VMware vSAN cluster built on Dell Ready Nodes. Whist I’ll be writing about that particular technology here, the principles apply to other HCI infrastructures.



If the problem of HCI could is scaling, the solution is scale. These imbalances in load and growth balance out once a number of VMs are on the system- and this scale doesn’t have to be massive, even from the 4-host starting point of a vSAN cluster I found that when the time came to install node 5, the demands on storage and memory were roughly matched to the relevant capacities of the new node.

The original hosts need to be sized correctly, but unless you’re starting in a totally greenfield environment then you will have existing hosts and storage to interrogate and establish a baseline on current usage requirements. Use these figures, allow appropriate headroom for growth, and then add a bit more (particularly when considering the storage) to prevent the new infrastructure from running near capacity. Remember you are trading a certain level of efficiency for resilience – the cluster needs to be able to withstand at least one host loss and still have plenty of capacity for manoeuvre.

If you are going down the vSAN route, I can thoroughly recommend the ReadyNode option. Knowing that hardware will arrive and just work with the software-defined storage layer without spending hours digging in the Hardware Compatibility Lists was a great time saver, and we’re confident that we can turn round to our vendors and say “this didn’t work” without getting told “it’s because you’ve got disk controller chipset X and that’s not compatible with driver Y on version Z”. There’s a reason I named this blog “IT Should Just Work”.DellEMC vSAN ReadyNode

When expanding the cluster I consider best practice to be to expand with hosts of as similar configuration as possible to the original. If larger nodes are added (for example, storage/memory/CPU is now cheaper/bigger/faster) then these can create a performance imbalance in the cluster. For example a process running on host A might get access to a 2.2GHz CPU, but run the same process on host B with a 3GHz CPU and it will finish slower. Also worth considering is what happens when a host fails, or is taken into maintenance mode for patching. If this host is larger than it’s compatriots then (without very careful planning and capacity management) there might not be sufficient capacity on the remaining hosts to keep the workloads running smoothly.

It is possible in vSAN to add “storage-only” nodes, reducing the memory and possibly going single-socket (this saves on your license cost too!) and then using DRS rules to keep VMs off the host. Likewise “compute-only” nodes are possible, where the host doesn’t contribute any storage to the cluster. Whilst there are probably specific use-cases for both these types of nodes, the vast majority of the time I believe them to be best avoided. Without very careful consideration of workloads and operational practices these could easily land you in hot water.

So, I’m a convert. Two years down the line here and HCI is the on-premises infrastructure I’d recommend to anyone who asks. And those clouds gathering on the horizon? Well, if you migrate to VMware Cloud on AWS then you’re going to be running vSAN HCI there too!

Rubrik Build Workshop

Last week (end of May 2019) I was lucky enough to secure a place at the Rubrik Build Workshop in London. This event, which has been touring round the world, is a day of technical learning focussed on API, SDK, and version control.

Roxie at RubrikThe first thing to acknowledge here is even though Rubrik was hosting the event and the presenters  (the awesome pairing of Chris Wahl and Rebecca Fitzhugh) work for the company there was absolutely no sales push. Whilst they used their own APIs and SDKs as examples the majority of the content was very much platform agnostic. Kudos is due here for running this kind of free-of-charge educational event for the tech community without filling it with sales and marketing slides.

The morning started with a session on version control- looking at how git and in particular GitHub – can be used to track and share code. The “RoxieAtRubrik” GitHub account was used in some hands-on demos -we all forked a public project, made changes,  and submitted a pull request. The course material used in the workshops is publicly available via this account- check here: https://github.com/RoxieAtRubrik

There were some insights into how GitHub is used at Rubrik- there’s unit tests for every single function and in the background they have a CI (Continuous Integration) pipeline at work to make sure releases are up to scratch. Quality control can be tricky on community fed projects where developers may not be subject to traditional corporate control and it’s interesting to see how different teams handle this input.

Our dive into version control was followed by a look at how REST APIs work, using the Rubrik APIs as an example. There was plenty of hands-on activity here, with an online lab provided to simulate communicating with a real world device but in a safe environment.

Rubrik Hands on Lab Environment

The schedule of this event was flexible and after a show of hands amongst the 15 delegates we moved on to look at PowerShell, both in general terms for those new to the scripting language but also seeing how the SDK layer of the Rubrik PowerShell module made the API calls we’d looked at previously more user-friendly.

This PowerShell module is open source and available on GitHub- https://github.com/rubrikinc/rubrik-sdk-for-powershell – and as with all these projects contributions are welcome from the community. There was lots of encouragement from the presenters for customers/users to try these SDKs out and feed back any improvements that could be made, either by submitting an feature request or bug report, or by writing some or all of the addition yourself.

The European leg of the Rubrik Build tour has finished, but they’re off to Australia and New Zealand in June if that’s local to you. Check out https://build.rubrik.com/ for details.

VMworld 2018 Banner

Wear Comfortable Shoes

Ladies and Gentlemen of VMworld 2019.

Wear comfortable shoes.

If I could offer you only one tip for the conference, comfy shoes would be it.
The long term benefits of comfortable shoes have been proved by scientists, whereas the rest of my advice has no basis more reliable than my own meandering experience…
I will dispense this advice now.

Enjoy the knowledge and learning imparted at the breakout sessions; oh nevermind; you will not understand all the knowledge and learning imparted until you watch the recordings.
But trust me, in 20 years you’ll look back at your notes from the event and recall in a way you can’t grasp now how much technology lay before you and how fabulous that UI really looked…
You can’t fit in as many parties as you imagine.

Do one thing everyday that scares you.

Present a session.

Don’t ignore other people’s opinions, don’t put up with people who ignore yours.

Talk to people.

Don’t waste your time on free pens;
Sometimes there’s T-shirts,
Sometimes there’s LEGO.
The swag list is long, and in the end, it’s only what fits in your suitcase home that counts.

Drink plenty of water.

Maybe you’ll do the Hackathon, maybe you won’t, maybe you’ll watch a vBrownbag, maybe you won’t, maybe you’ll get an early night, maybe you’ll dance the funky chicken at the VMworld party.
Whatever you do, don’t worry too much when someone says on-premise.
Enjoy your time at the conference, Use it every way you can… Don’t be afraid of doing new things, or what other people think of them,
Spending time wisely is the greatest investment you’ll ever make…

Use that Early Bird pricing, you’ll miss it when it’s gone.

Be nice to your peers in the vCommunity; They are the best way to learn and the people most likely to stick with you in the future

Stretch.

Go to VMworld US once, but leave before it makes you hard;
Go to VMworld EU once, but leave before it makes you soft.

Accept certain inalienable truths, vBeards will grow and turn grey, vendors will talk FUD, you too will get tired, and when you do you’ll fantasise that when you were younger vChins were clean-shaven, vendors were noble, and the flash client was the best thing since sliced bread.

But trust me on the comfortable shoes…