Tuesday, December 29, 2015

Monitor CentOS 7 Client with OMD / Check_MK

Quick instructions for adding a CentOS 7 client to OMD / Check_MK.

Log into the CentOS 7 client via ssh/terminal. Install check-mk-agent

# yum install check-mk-agent



Once installed, edit xinetd config:

# nano /etc/xinetd.d/check-mk-agent

Uncomment only_from and add IP of your OMD/Check_MK server



Save and Close

Start and enable startup for xinetd service:

# systemctl start xinetd.service
# systemctl enable xinetd.service

Add port 6556 to firewall:

# firewall-cmd --add-port=6556/tcp --permanent
# firewall-cmd --reload
# firewall-cmd --list-all

Now you can add the client to your Check_MK panel via New Host and it will automatically add the checks.

Wednesday, November 25, 2015

Formatted with Type 2 Protection, huh?


Bought some Seagate SCSI disks (ST9600104SS) with a synology expansion unit (RX1213sas) to expand the storage array of a Synology Rackstation device.  If you recognize the title then you know why I am posting.

I added this new batch of disks to the synology expansion unit, connected the expansion via external mini-SAS cabling to the host Synology and in the DiskStation administration panel the disks show up just fine, great.  I attempt to expand the raid group, nothing, no gui message, no error message.  After about two weeks of troubleshooting (expansion unit cabling, etc), I check /var/log/messages and get my first real clue:

sfdisk: exception.c:159 Error: Input/output error during write on /dev/sas15

I/O error, I then try to partition the disk using fdisk (gparted is not available), and same issue, I cannot write partition information to the disk.  At this point I was not sure if the issue was expansion unit related or disk related.  Over a two week period the following took place to help troubleshoot:
  • The cold spare disks from the original batch work just fine in the expansion unit, so the expansion unit and cabling are good.
  • The new batch of disks work in a Dell server with a raid controller just fine, was able to write a partition via the raid controller, and I even created a volume and installed an OS.
  • Once I partitioned/wiped the disks via the Dell server I tried them in the synology expansion unit again, same I/O error as before.
Got a breakthrough by using smartclt (that was thankfully available on the synology) to get smart information of the new batch of disks, and compare with the old batch:

Original disk [serial blanked out]:

NAS> smartctl -i /dev/sas1
smartctl 6.2 (build date Oct 28 2015) [x86_64-linux-3.10.35] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST9600104SS
Revision:             FMF2
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Rotation Rate:        10000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c5003c3e2d97
Serial number:        
Device type:          disk
Transport protocol:   SAS
Local Time is:        Tue Nov 24 07:40:08 2015 CST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

New disk:

NAS> smartctl -i /dev/sas15
smartctl 6.2 (build date Oct 28 2015) [x86_64-linux-3.10.35] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST9600104SS
Revision:             MS05
User Capacity:        600,127,266,816 bytes [600 GB]
Logical block size:   512 bytes
Formatted with type 2 protection
Rotation Rate:        10000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c5002891f82b
Serial number:        
Device type:          disk
Transport protocol:   SAS
Local Time is:        Mon Nov 23 11:36:05 2015 CST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

Notice the differences?  Well for one, the firmware (revisions) are different, this is because the original batch of disks are Dell (OEM) branded disks, the new batch of disks are Seagate (retail) branded disks.  But it was the second difference that caused me some confusion:

Formatted with type 2 protection

Not knowing what this was, I then went down a seemingly never ending spiral of T10 Protection Information [PDF] standards.  Its pretty neat, how I understand it is the disk controller formats the platters to 520 byte sectors, instead of the more traditional 512 byte sectors, these 8 extra bytes per sector are there for the controller to make sure that the data written to that sector is the same data that is read from it, sort of like data verification.  The disk controller can then presents the system (HBA controller or raid card) with the normal 512 bytes of data per section, and any SCSI compatible controller should be able to read and write to it just fine.

It was here I focused on this error, and not on attempting to match the firmware on the disks.

First problem I needed to tackle was to find a way to use some better tools on the disks, the built in Synology utilities are pretty bare-bones.  Since these are SAS disks I can't pop them in a desktop to work on them, I luckily had an old server that uses an HBA controller (for software raid) and not a raid card.  This made it easier to query the disks directly.  I ran smartctl on this server to get the smart info from a new disk, and it DID NOT explicitly say it was a Type 2 disk, like above, which through me for a loop, but just affirmed I needed better tools.

In Seagate's T10 Protection Information document [above] there is a paragraph on how to set and determine the PI Type using FMTPINFO.  After looking into how I can query this, I found that linux has a suite of utilities designed just for this purpose, sg3_utils.  The CentOS live USB I was using sadly did not have this package installed, but it was in the repos:

yum install sg3_utils

Using the amazing example area of this Ubuntu man page on sg_format, I once again queried an original disk and a new disk for the Type information:

Original disk:

[root@livecd ~]# sg_readcap -l /dev/sda
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Thin provisioning: tpe=0, tprz=0
   ....

New disk:

[root@livecd ~]# sg_readcap -l /dev/sda
Read Capacity results:
   Protection: prot_en=1, p_type=1, p_i_exponent=0
   Thin provisioning: tpe=0, tprz=0
   ....

Notice the prot_en and p_type bits, now I knew without a doubt the first batch of disks and second batch of disks I purchased are two completely different formats.  Unknown at this point to me was WHY the NAS controller would not read and write to these disks, but I figured if I can low-level format the disks with NO protection information, then I might get lucky.  Thankfully the Ubuntu man page above has excellent examples, and I was easily able to format them with sg_format, please note a format completely erases the disks!

[root@livecd ~]# sg_format --format --fmtpinfo=0 /dev/sda
SEAGATE   ST9600104SS   MS05   peripheral_type: disk [0x0]
  << supports protection information>>
Mode Sense (block descriptor) data, prior to changes:
  Number of blocks=1172123568 [0x45dd2fb0]
  Block size=512 [0x200]

A FORMAT will commence in 10 seconds
ALL data on /dev/sda will be DESTROYED
    Press control-C to abort
A FORMAT will commence in 5 seconds
ALL data on /dev/sda will be DESTROYED
    Press control-C to abort

Format has started
Format in progress, 0% done
....
Format in progress, 99% done
FORMAT Complete

The format took about 8 hours, lets do another check for the protection type:

[root@livecd ~]# sg_readcap -l /dev/sda
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0

   Thin provisioning: tpe=0, tprz=0

Woohoo!  With the disks formatted with no protection information I threw them back in the expansion unit to try them once again, and what-do-you-know, Diskstation was able to add the disks to the Raid Group just fine:


After the initial failures with this batch of disks I could have easily scoffed, returned the disks or expansion unit for a refund (and I would have gotten one), but instead a simple curiosity of finding out what Protection Information was led me to the solution, and knowledge I can bring to other similar problems in the future.

Wednesday, October 21, 2015

Update OMD (Open Monitoring Distribution) from 1.20 to 1.30

Open Monitoring Distribution was recently updated to 1.30, here is how to update.  My instance of OMD was installed with the consol.de repo.  I recommend this because it is an easy way to install and update OMD, I am using the stable release, and CentOS 7 as the linux distro.  Bad part is check_mk updates faster than OMD! ;)
Next we can check for update:
Next we will install the 1.30 bits:
With the bits installed we need to log into the monitoring site so it can be updated:
Check current version:
Run omd update
Update!
Start site
With the bits updated and site started the GUI should now be working.  Now its time to update all of my clients from check_mk 1.2.4p5 to 1.2.6p12, the new agents are found in /share/check_mk/agents directory of your site folder.

Wednesday, September 23, 2015

Windows 10: Fix your broken start menu + apps on a domain, Event ID 1000


Windows 10 was designed around a new user experience that is supposed to close the gap between tablet/mobile and desktop interfaces.  But what if these new graphical elements (the old name was Metro Apps) will not open?  There are a LOT of cased out there about the windows 10 start menu being broken, and plenty of fixes, but no matter what I tried the start menu would not work.

The domain we are on was updated from SBS 2011, there are a lot of group and security policies, and Windows 10 would work fine, until it was added to the domain.  Once on the domain any attempt to access the start menu, Edge, right clicking on the taskbar or accessing any other new UI elements caused an immediate close of the app, or it wouldn't open at all, and a new entry would show in event viewer:

Event ID: 1000 Application Error
Faulting application name: ShellExperienceHost.exe, version: 10.0.10240.16425, time stamp:
Faulting module name: Windows.UI.Xaml.dll, version: 10.0.10240.1643
1, time stamp:
Faulting application path: C:\Windows\SystemApps\ShellExperienceHost_cw5n1h2txyewy\ShellExperienceHost.exe
Faulting module path: C:\Windows\System32\Windows.UI.Xaml.dll

You get the idea, not good.

I finally found a fix buy using procmon and tracing shellexperiencehost.exe and seeing that there was an exit at the point of the fonts folder.  Ah hah, so after checking the permissions of the fonts folder sure enough, there is not a ALL APPLICATION PACKAGES security group listed.  After adding it the new UI elements and start menu started working immediately.  I rebooted just to see if the new permissions survive a reboot and they did.

Here's how to add this group to the fonts folder:

Open cmd as administrator and run the following, it clears the read only and system file attributes from the folder which allows you to edit it.

 attrib –r –s c:\windows\fonts  

You will see the fonts folder looks a little different:
You can now right click on it, then select Properties.  On the security tab click Edit then Add..

Select Locations and choose the computer, then add the ALL APPLICATION PACKAGES group and then check names:


After clicking OK, I gave the group Full control of the folder, I am not sure what specifically it needs to be set to, but this worked for me.

You can change the Fonts folder back to default read-only by doing

 attrib +r +s c:\windows\fonts  

Apparently the new apps under this identity was not able to access the font required to render the app.  If you are having these issues keep in mind that the ALL APPLICATION PACKAGES group is required for many folders in Win 10 and it might not have the access it needs.

Thursday, April 23, 2015

Server Core: Uninstalling an application without control panel

As I spend more time in Windows Server Core and Hyper-V Server I appreciate more of the under-workings of the operation system, I also appreciate GUI more :).  Without a control panel uninstalling an application is not as straightforward.

In a command prompt open up the registry editor:
 c:\>regedit.exe  

Browse to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall

Here installed programs are listed by application ID.
Browse through them and look for the Display Name to identify the application you need to uninstall.
There will be a key called UninstallString, right click to Modify and copy its contents:
This value will be used in the command prompt, running msiexec.exe with the /i switch.  /i configures the application.
Depending on the application, the uninstaller will run, which may be a GUI or another automated uninstaller.

You should see the Application ID removed from the Uninstall reg key from above.

Wednesday, April 22, 2015

List of White Noise / Ambient Sounds to stay focused and increase your productivity

One thing that has helped me stay focused is light background noise and sounds.  It helps drown out possible interruptions and people tend to not bother me with headphones on :). I keep it mixed up by rotating several difference sources; here is a list.

  1. myNoise.net - Large collection of hand recorded sounds which are great quality.  Each sound has a mixer with presets that slightly changes the sound and can be 'animated'. 
  2. A Soft Murmur - Easy to adjust volumes of different sounds to create your own mixes, with a master volume.
  3. Noisli - Good selection of different sounds that can be individually controlled.  Love the railroad track.
  4. Coffitivity - Cafe and other environmental sounds.
  5. Rainy Cafe - You have Rain, and you have Cafe, simple interface so adjust the volumes how you see fit.
  6. raining.fm - Rain, thunder, and lightening, can be mixed together.  The break and sleep timers are a nice addition.
  7. ambient-mixer.com - Community driven noise and sound collections, large library.

Tuesday, April 21, 2015

TrippLite: Upgrading firmware on a SNMPWEBCARD

The TrippLite SNMPWEBCARD allows for remote access and alerting for TrippLite UPS's and PDU's.  Please read the resources below first.  This card I am upgrading is a Generation 3 card, came with version 012.004.052 of the firmware, and it will be updated to 012.006.064.  Download the latest firmware from the card's home page.  This is a folder that contains multiple firmware versions, as well as the massupdate.exe program.  I unpacked it to c:.

Make sure you read the READ ME - RELEASE NOTES, as firmware has to be updated in a certain order, the card can be bricked if it is updated incorrectly.  The update order for my card is to .55 then to .64.  The .55 update we will use FTP, then to go to .64 we will use the recommended massupdate.exe found in the Utility folder of the firmware package.

First connect the serial cable, this is a cable provided by TrippLite, other cables may not work.  Putty into the web card using the serial connection with the following settings:


If you have an issue connecting.  Connect via putty, unscrew and unslot the card, plug in the serial cable between the web card and PC/Server and slot the card back into the UPS/PDU.  This reboots the card and the stream should then output to the putty console.

Once connected to the card, login and reboot it:

We will use the serial stream to keep an eye on what the card is doing, but we will use command prompt with the windows FTP program to upload the new firmware.  Open a command prompt and browse to the firmware folder you unpacked.  Use FTP "IP Address" to open an ftp session with the web card, it will ask you to log in.
Change to binary mode with the 'bin' command and upload the firmware with the 'put' command, .55 in my case.  There is no tab completion, so type accurately.
Once it is uploaded type 'bye'.  The card will install the firmware and restart:
Upon bootup it shows the .55 firmware was uploaded.
Repeat the same steps to upload the pwralert.dat file:
The card will once again reboot, this may take awhile.

Before using massupdate.exe to update to .64, I had an issue with the application logging into the card to upload the new firmware, so I decided to reset the settings on the card.  This is done by rebooting the card and pressing a key on the console before 5 seconds is up.  It asks you if you want to reset the settings.  This changed the IP to dhcp and changed the admin password.  If you have issues with massupdate.exe connecting, this may fix it.

Now that we are on .55 we will use massupdate.exe found in the Utility directory (SNMPWEBCARD-FW-Gen3-12-6-064-RC1\Utility) to update the card the .64.

Open massupdate.exe.  Under Path of Files to Update click browse and select the Version directory (\SNMPWEBCARD-FW-Gen3-12-6-064-RC1\Firmware\Version 12.06.0064)
Click Add Devices at the bottom, enter the IP of the web card.  I used the default login.
Click update on the bottom and click Start on the progress window:
The image will upload and the card will reboot.  Be patient while the firmware installs, its a major upgrade and it takes awhile.
The card will reboot (again this is a long process) and you will be on the latest firmware.

Once the card is fully loaded you can connect to the web admin panel using the DHCP IP.  Also enjoy whitelisting all of the Java stuff, because the panel is Java, the browser will block it.

Resources

Make sure to go through these in detail, as well as the firmware readme file, as it is possible to brick the card.

Thanks to johndball.com, it seams his website is down, but please read his post on this card before you attempt a firmware update.



Tuesday, April 14, 2015

Exchange 2013: Database copy has been blocked from automatic activation on server by an administrative action. Reason: None specified.

After updating two Exchange 2013 mailbox servers in a DAG environment to CU8 and running Test-Replicationhealth, Database Availability reported an error:
I tried to use -autosize to show the entire error which wouldn't work, but after googling around I found this:

 $r = Test-ReplicationHealth  
 $r | ?{$_.Result.Value -ne 'Passed'} | fl  

The entire error is as follows: Database copy 'DB01' has been blocked from automatic activation on server 'EXCH2' by an administrative action. Reason: None specified..

Its actually an easy fix, go into the ECP, Servers -> databases, suspend the passive healthy DB, wait a moment then Resume it.
Now test-replicationhealth again and you should be good.

Thursday, March 12, 2015

Spotify Desktop (Windows): How to downgrade so that you can actually use Spotify

Spotify recently updated to version 1.0 and with it broke a lot of features that users have come accustomed to.  Most notably for me the client freezing up and causing songs to blip.

Download 0.9.14.13 here: http://www.filehorse.com/download-spotify/17905/download/
Link warning: it will automatically download.

Install it and you will have the older version, but we need to disable the autoupdate, sadly there is no option to do this in the preferences.  Follow these instructions to disable auto-update:

Browse to %appdata%/Spotify/

If there is a Spotify_new.exe application, delete it.

Create a new text document in the same folder:

Make sure extensions are showing and rename it to Spotify_new.exe

It will ask you to set the file type to exe, which is correct, you don't want it to be a text file.

Now set Spotify_new.exe to read-only: Right click on it, click properties and check the Read-only Attribute:

You should be all set.  It feels good to be back!

A lot can be said about the state of Spotify as a company at the moment.  This is release 1.0, it should be HUGE, it should be polished and beautiful and something the company and devs and product managers be proud of.  What happened?  Between the decisions that went into 1.0 and them no longer offering some of my favorite albums lately, I am considering taking my subscription somewhere else.

Tuesday, February 10, 2015

Server 2012 r2: Disk # has been surprised removed, event ID 157 when backing up VHD's.

I have a Server 2012 r2 Hyper-V virtual machine that has 3 disks (vhd's) mounted.  One for the C drive, one for Logs, another for DB storage.  A VSS backup is taken every night, and during this time there is some curious event logs being recorded:



The highlighted partmgr event 58 is described as: "The disk signature of disk 3 is equal to the disk signature of disk 0."  As well as the other 2 partmgr events pairing the virtual disks 1 and 2.  The disk event 157 is described as "Disk 3 has been surprise removed"  As well as the other two disk events saying the same for disk 4 and 5.

My theory is when a VSS backup is taken 3 new disks are created (3,4,5) that match the original virtual machine disks (0,1,2).  Data is then backed up to the new disks and a signature check is done (partmgr event 58) to make sure they match.  Disks 3,4,5 are then ejected from the system, and event ID 157 is recorded.

In that case when backups are taken these events should probably be defined as Informational and not Warning, in order to keep logs clean.