CentOS 7 kernel boot order bug


One of the best of times and worst of times for professional Linux users is when Red Hat releases a new major version of Enterprise Linux, RHEL/CentOS. We are on version 7.1 and a lot of the angst and angry discussion over new additions like systemd and such has started to die down and people are accepting “progress”. It is also really a nice time, because RHEL 7 is up-to-date enough to make a nice desktop workstation OS. I have moved most of the desktop and laptop machines I use over to CentOS 7 with the “Mate” desktop environment and I am pretty happy with it. However it is not bug free yet. It usually takes at least until version “.2” before most of the annoying bugs are ironed out on a new Enterprise Linux release. This is part of the reason that people are generally not in much of a hurry to update to the latest RHEL/CentOS after a new major number release. I have been butting heads with a particularly annoying bug that I hit frequently on installs since I work with systems that need to have kernel modules recompiled for CUDA and the Xeon Phi.


The following is specifically related to CentOS Linux release 7.1.1503

The bug I am referring to is the kernel order grub2-mkconfig bug;

Bug report
https://bugzilla.redhat.com/show_bug.cgi?id=1124074

From searching around the net I’ve found that a lot of folks have been blindsided by this and there is often misinformation being given as advice as to its cause and fix. (how could that happen on the internet ๐Ÿ™‚

The basic problem is this; If you, or any package you are installing, runs grub2-mkconfig to modify the grub2 config file it will re-sort the kernel boot ordering incorrectly. This means that if you have just compiled a kernel module against an updated kernel source, on reboot, you may have a different kernel running than you expect and thus a module that will not load.

You can do the following (as root) to see what ordering you get

grub2-mkconfig  | grep menuentry


I have a workaround and a fix toward the end of the post.

CUDA

If you do a cuda install it will run the following script as part of the configuration. This is great since it fixes the nonsense with having to blacklist the nouveau video module. However, currently on CentOS 7 this triggers the sort order bug from the lines that call grub2-mkconfig or grubby.

ISGRUB1=""
 if [[ -f /boot/grub/grub.conf && ! -f /boot/grub2/grub.cfg ]] ; then
     ISGRUB1="--grub"
     GFXPAYLOAD="vga=normal"
 else
     echo "GRUB_GFXPAYLOAD_LINUX=text" >> /etc/default/grub
     grub2-mkconfig -o /boot/grub2/grub.cfg
 fi
 if [ -x /sbin/grubby ] ; then
   KERNELS=`/sbin/grubby --default-kernel`
   DIST=`rpm -E %{?dist}`
   ARCH=`uname -m`
   [ -z $KERNELS ] && KERNELS=`ls /boot/vmlinuz-*${DIST}.${ARCH}*`
   for kernel in ${KERNELS} ; do
     /sbin/grubby $ISGRUB1 \
       --update-kernel=${kernel} \
       --args="nouveau.modeset=0 rd.driver.blacklist=nouveau video=vesa:off $GFXPAYLOAD" \
        &>/dev/null
   done
 fi

Xeon Phi

Nothing about doing the Phi setup triggers the bug, however, when I do an MPSS install for the Phi driver I rebuild the rpm for the kernel modules against an updated system kernel. This means that if you then do something like an NVIDIA driver or CUDA install it will hit the kernel ordering bug and leave a systems that doesn’t start the Phi drivers. If you don’t see that this has happened it can be puzzling to figure out. I always freeze kernel updates in yum when I do a Phi install to avoid having to rebuild the module rpm but if the existing kernel order changes without your knowledge you will have a Phi driver that doesn’t start.

#/etc/yum.conf
# The following line excludes kernel updates from yum update.
exclude=kernel*

UEFI boot is OK!

I was really getting confused when I was trying to figure out what was going on with this kernel order bug because it seemed random. I have been doing UEFI installs on systems when I can and this bug does not show up on a UEFI install! If you look in /boot on a systems with a UEFI install you will see a /boot/grub2/grub2.cfg file that has the wrong sort order for the kernel menuentry lines and a /boot/efi/EFI/centos/grub.cfg file that has the correct order for the menuentry lines!

Workaround

A simple workaround would be to just set the default kernel to use to the one you want and not worry about the sort order. ( Knowing that you may need to adjust that when you make systems changes ) For example if your ordering is messed up so that the newest kernel is now the second menuentry you can do the following Note, that the count starts from 0.

grub2-set-default 1

“Fix”

You can fix the problem in the script file that is executed by grub2-mkconfig using sort instructions that will do the right thing. Look at the script file /usr/share/grub/grub-mkconfig_lib around line 264 you will find the function version_find_latest () that’s where the problem is!

Nathan G. Grennan has posted a patch for this file — thanks Nathan!

https://bugzilla.redhat.com/attachment.cgi?id=966036&action=diff

You can generate a patch or just edit the file. An edited version of the version_find_latest() function looks like;

version_find_latest ()
{
#  version_find_latest_a=""
#  for i in "$@" ; do
#        if version_test_gt "$i" "$version_find_latest_a" ; then
#          version_find_latest_a="$i"
#        fi
#  done
#  echo "$version_find_latest_a"
  {
        for i in "$@"; do
          echo $i
        done | grep -v rescue | sed 's/.x86_64$//g' | sort -V -r | sed 's/$/.x86_64/g'
        for i in "$@"; do
          echo $i
        done | grep rescue | sort -V
  } | head -n 1
}

I hope this post savessome of you a headache trying to to track down what was going on with this bug. I’m hopeful that version 7.2 of RHEL/CentOS will have this, and any other remaining annoying bugs sorted out and all will be right with the world again.


Happy computing! –dbk