Grub bites man. Man stamps on grub. ‡

‡ — No invertebrates were harmed during the production of this article.

Almost exactly two weeks ago I got a (normal) pop-up message on my screen, letting me know that there were updates available from Ubuntu for my 21.10 version of Mate and that, by the way, 22.04 was now available if I wanted to upgrade.  I politely refused the upgrade offer, but accepted the updates.  Once the updates were completed I was prompted to reboot (and I noted that the initramfs had been updated).  This is all normal stuff for the millions of us running the various versions of Ubuntu, so after finishing up what I was working on (and washing my hands) I hit the reboot button.

It didn’t.

It hasn’t for the last two weeks (not without a recovery image or an external disk plugged in, anyway).

At first I wasn’t too concerned.  This (as the ‘mericans say) isn’t my first rodeo and, if everything else failed, I’m using ZFS, so I just had to roll back to the last good snapshot, right?

Wrong!

As I wasted more and more time futzing around with things, I just got more and more confused. The EFI boot was working and grub was being run, but it insisted that it couldn’t find a workable filesystem.  Booting into a live-CD image said there were two perfectly good ZFS pools on the disk (actually a 512GB NVME SSD) and a “zfs scrub” passed with flying colours on both of them (bpool and rpool).  Running grub-probe always came back with a message along the lines of “Nope, ain’t nuthin there that I recognize, pinhead!”.

By this time I had convinced myself that someone had broken into Ubuntu’s update servers and slipped a Trojan into my disk headers, or maybe someone was slipping magic mushrooms into my breakfast tea, or aliens had abducted my real ZFS pools and replaced them with zombies.  Anyway, I needed to get some work done, so I found an unused external USB disk, fired up a live-CD again and, because this is ZFS, did a ZFS send-receive between the original disk and the external.  When it came time to run the grub install (on the external disk) it once again insisted that there was nothing at all there which it recognised as any sort of valid filesystem.  Duh!

Okay, nuts to this.  I put a 22.04 live-CD image in, booted it up (“Oh look, you have two perfectly good ZFS pools on this external disk!  Are you really sure you want me to scribble all over them?”) and installed a brand spanking new version of 22.04 onto the external, followed by a ZFS send-receive of just the USERDATA filesystem from the original NVME drive (I’d like to say that I rebooted and everything sprang into life on the external disk, but of course it didn’t.  You can’t easily remove an NVME “disk” from a laptop and of course EFI and grub just kept right on trying to access it).

Eventually, I did manage to broker a truce between the laptop BIOS, EFI and grub, to the point that I could reboot the system successfully, as long as I was actually there to hit the F9 key and manually select the EFI entry for the external disk  …otherwise it would stubbornly continue with a blinky dance of reset, spin disk up, flash something on the screen for 10ms, spin the disk down and repeat.  I never did manage to read what that 10ms micro message was (something from those dang aliens again).

But glory hallelujah!!  It was fantastic!  Not only did I have Mate back again, but the stupidly useless ELAN touchpad on the HP S15 laptop actually worked properly for the first time ever!!  Banzai!!  Well done 22.04!  I’m sorry that I refused your offer of an upgrade at the start of all of this.

Well, long story longer, that’s the way that things have stayed for the past two weeks.  Not that there hasn’t been great gnashing of teeth (not recommended at my advanced age) and thrashing of Gewgull searches (and even Ubuntu’s bug database) for any hints of what could be causing my problems. Searches including both “grub” and “aliens” turned up some interesting stuff, but it seems that I’m the only semi-sentient being-man-thingy in the universe who is having problems with those things in connection with computers.

Eventually, the third neuron in the second bank started firing occasionally and I began to have flashbacks of the horrible problems I had a while back, trying to do ZFS back-ups between my Linux laptop and the mirrored disks on my FreeBSD servers.  Although the FreeBSD servers could send data to a receiving  Linux system, the other way around just would not work at all.  It turned out that there were incompatibilities between the ZFS properties used between the two systems, even though the ZFS versions were meant to be the same (upgrading the FreeBSD servers to version 13.1 seemed to cure that issue).  Now it seemed probable that there was some similar incompatibility between grub and ZFS  …and, finally, it clicked. 

One of those property updates which hit FreeBSD a while back was the introduction of “zstd” compression (Ubuntu introduced it a short while later).  I had tried changing the compression to “zstd” on FreeBSD and noticed a worthwhile increase in compression-ratio with no apparent difference in speed, so I’d made the changes on my laptop, too.  The back-ups kept on working and I was saving more disk space …what’s not to like?

Double-duh!  Pinhead at work (stand clear!).

Okay, so I have (I suspect, anyway) a philosophical disconnect between grub and ZFS about what constitutes a filesystem.  In my book, ZFS wins (whatever grub says, I’m an ex-Sun guy and old loyalties die hard).  What to do now?  Well,back on Gewgull I replace “aliens” with “alternative” and get a few hits on something called “ZFSBootMenu”.  I go to their GitHub repository and start reading.  Nope, I must have left “aliens” in there somewhere, ‘coz I can’t understand this stuff.  It starts off more or less okay (I understand their reference to FreeBSD’s boot environments …a very handy feature,  sign me up), but then they start wandering off into the weeds with things like “fzf” and “dracut” (which sounds like a bad day at the vasectomy clinic) and I want to stop reading so badly that my teeth hurt (told you not to gnash).  I briefly leave the README page and head over to the pre-built releases instead. 

Ah, this is interesting.  “ZFSBootMenu v2.0.0  …New features – Dracut is now optional;”.  Phew!  Well that’s a relief.  Unfortunately, there are four different files (and additional source tarballs), but no information on what the differences are.  A couple have “release” in their names and the others have “recovery”, but no actual information in the v2.0.0 section about what the differences are or how to use them.  Two are compressed tar files (okay, I know what those are, anyway) but the two others have a “.EFI” suffix, which I haven’t come across before, but it’s not too much of a stretch to suspect that they should be used in the Extensible Firmware Interface (that’s the BIOS to you, sonny!).  But how?

Back to the README:-

“Each release includes pre-generated images (both a monolithic UEFI applications as well as separate kernel and initramfs components suitable for both UEFI and BIOS systems) based on Void Linux. Building a custom image is known to work in the following configurations…”

Well there you go then.  C’mon!  Chop-chop!  Get on with it!

I won’t bore you with the rest of it, but despite the rambling (or partly missing) documentation, it turns out that ZFSBootMenu is really quite a nifty bootloader (grub alternative) specifically for ZFS filesystems.  It does have a couple of quirks, but after a few false starts, I did manage to bring my 21.10 NVME SSD back from zombieville and ended up with a laptop which I could once again disconnect from all external cables and disks and still have it boot reliably.  To do ZFSBootMenu full justice I need to put together another post with a step-by-step guide of how to recover a stuffed ZFS machine, but just in case you’re stuck up this same creek without a paddle, here are just a couple of essential hints to getting ZFSBootMenu to work:-

  • Use the .EFI  file (I used the “release”, but “recovery” should also work okay).
  • Remember– When you’re booting using ZFSBootMenu, it assumes that the kernel and initramfs are on the root partition, so you’re going to be booting from “rpool”, not “bpool” (and so your kernel and initramfs need to be present on “rpool”, which is not standard in Ubuntu).
  • Disconnect any other disks which you may have added to the system in previous attempts to fix this boot problem (ie:- external USB disks with copies of your original bpool/rpool filesystems).  They will just confuse EFI, grub and you.
  • Set “canmount=noauto” on all filesystems which have “mountpoint=/” set (this one is really important).
  • Make sure that your ZFS pools can be imported without having to use “import -f” (use a live-CD or ZFSBootMenu itself to import the pools manually, fix any problems and then export them again, before you reboot).
  • Lastly, for those others of you out there with non-US keyboard layouts, the “Alt” key for command mode selection within ZFSBootMenu will probably not be the “Alt” key.  For my Japanese keyboard, the magic key turned out to be the one key which I had never before pressed on my keyboard …the (shudder!)  “Windows” key.

And the bottom line… Just don’t change your bpool to zstd compression.  Okay?!?    †  ∇


†  —  From a cursory check of the Grub2 code in the upstream Debian repository, it looks as though the zstd compression library was added in November of 2018.  Unfortunately it seems to have been implemented only for BTRFS, not for ZFS.

  —  “cursory”  Adjective.  Causes the reader to leap out of their chair and shout “G’dammit!!”  (this answer is sure to get you an “A” in GCSE English).

∇  –  Grub Compatibility  —  As usual, the folks at OpenZFS/Debian/Ubuntu are way ahead of me.  If you need to create a new pool which will work with Grub2, there is already an existing compatibility file (they’re stored in /usr/share/zfs/compatibility.d).  You can use it like this:-  

   zpool create -o compatibility=grub2  <POOL NAME>

If anyone is desperate enough to have read all the way down here to the bottom, I’ll just note that this problem hasn’t been submitted as an actual bug, because the auto-submission procedure won’t let you submit a bug against a program which isn’t installed on your system and, of course, I’d removed the grub package and used ZFSBootMenu to actually get my laptop working again. However, the secondary suggested method of opening the issue as a question in LaunchPad is available here if you’re interested (and I’d like to thank Manfred Hampl for trying to help me out there).

Update — 27th July 2022 – I have submitted bug request #1982897 with a condensed description of this issue and a request for an upstream addition of “zstd” support for ZFS in the Grub2 code.

Update #2 — 27th July 2022 – Submitted bug #62821 as a feature request for the addition of “zstd” support to the Grub2 ZFS code.

Update #3 — 13th August 2022 – Updated the Grub2 bug report with a little more detail on the severity of the problem, as there hasn’t been any acknowledgement of the original submission as yet.

ESP12 — Bad batch out there

Just in case you haven’t seen it yet, GeekEnArgentina has reported on a bad batch of ESP12 boards which he received.  The symptoms were an endless boot loop when trying to program the device, with the problem being temporarily (and mysteriously) suppressed when a finger was touched to the upper, L/H corner of the ESP’s RF shield.

After two days of hectic troubleshooting (GeekEnArgentina was trying to ship out an order of 50 units to a customer), he eventually isolated the problem to a capacitor in the antenna circuit (which is assumed to be the wrong value, similar to the !470Ω LED resistor issue from a couple of years back).

His descriptions of the issue and the troubleshooting process are well worth a read.

 

PWM problems identified

People have been reporting problems with the PWM function on the ESP8266 causing resets for months, with others claiming rock-solid reliability for virtually identical applications.  There’s been a lot of back and forth on the ESP8266.com site regarding whether this was a real issue, or just a problem with layout/decoupling or perhaps just inadequate power supplies.

Unfortunately, the SDK API Guide from Espressif hasn’t helped matters too much with, as Pete Scargill pointed out, things like the “duty” parameter being documented initially as an 8-bit integer (page 178, “pwm_init()”) and then later as 32-bit (page 179, “pwm_set_duty()”), guaranteed to trip up the unwary (ie:- me!).

Finally, user “anszom” seems to have identified the issue causing the real problem, which is a 32-bit integer containing the address of an interrupt handler routine.  Apparently the address is not tagged as being 4-byte aligned (which it needs to be), which means it’s a matter of luck as to whether the value is aligned, or not, when your application is compiled.

Thankfully, “anszom” also provides a manual fix for the linker script to work around this problem until Espressif provide an update to the SDK (and, hopefully, the documentation).

ESP-06 Module Warning

Just a quick note here to raise a flag concerning module compatability, functionality and design.

ESP06 Module, bottom view
ESP-06 Bottom

I would guess that most people reading this probably know already that the company Espressif is the manufacturer of the ESP8266 chip and the producer of a couple of the reference designs to support it.  Most (though not all) of the actual modules (the PCBs with the ESP8266 chip and additional components required to support it) are produced by other companies.  The most prolific producer is “AI Thinker” and we (the hobbyist community) have a lot to thank them for  —  the low retail cost of the modules for one thing and the wide variation of different available designs for another.  Equally so, there have been some fairly big bumps along the road, with modules getting into the supply chain with magic smoke issues (the “self healing” LED current limiting resistor problem) and confusion from both the vendors and recipients of various modules when the item received was markedly different from the item ordered (to be specific, I’m talking about the ESP-12 variants here, not odd vendor miss-ships).

The lack of useful documentation has been one of the main bugbears with the ESP8266 from the very start and AI Thinker have been a fairly major player in that particular field, even if you do happen to read Chinese.

Well, unfortunately it looks rather as though it’s happening again.  Danny Von Lintzgy has left a couple of comments over at Pete Scargill’s blog noting that his product is pretty much dead in the water for the time being, as the latest batch of ESP-06 modules has different connections from all of those which he had received previously (the ESP-12 all over again).  Danny says “It seems the new variant has three of the corner pads connected to pins, where previously all four corner pads were connected to ground“.  That’s a pretty major change to have happen without any notification or hints (even the ESP12E at least had visible, extra pads).

So, if you’re using ESP-06 modules, please take note of Danny’s warning and be very careful with any new modules you purchase.  It doesn’t sound like a magic smoke issue, but if your new ESP-06 won’t boot and won’t program, this could be the reason.

Update – Danny has reported that AI Thinker support got back to him with the news that there are currently three versions of the ESP-06 out there.  The first is the older original, then (most worrying) there’s a not-quite-right version of the new design (which apparently has a problem with one of the resistors — but no news yet as to what sort of issues, if any, this errant resistor causes) and the last is the new design with the resistor problem corrected.  Apparently, the original and the newer versions have a slightly different AI Thinker logo on the outside of the RF shielding with (I believe) the newer versions having WiFi-style radio waves radiating from the head of the character which makes up the stylized “I” of the “AI”.