Buy Percona ServicesBuy Now!

Issue with /srv volume

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with /srv volume

    Hi,

    I'm trying to use PMM started from AMI, but keep encounter issues with that.

    we've started with m4.2xlarge instance. Added 50 mysql instances ,linux:metrics mysql:metrics mysql:queries (via slow log ), and 4 ProxySQL instances for monitoring. note: added LimitsNOFILE=65536 to prometheus service to get rid of "Too many open files" error.

    The issue is: on the second day I've noticed the PMM WEB UI became unresponsive. it apperas the LA is too high:

    #uptime
    14:01:33 up 1 day, 5:23, 1 user, load average: 24.00, 23.94, 23.39

    however atop shows cpu's are in idle, see screenshot attached,


    dmesg shows me
    [105465.363093] XFS (dm-4): metadata I/O error: block 0x23776f0 ("xfs_buf_iodone_callback_error") error 5 numblks 8
    [105466.633074] XFS: Failing async write: 2984 callbacks suppressed
    [105466.635627] XFS (dm-4): Failing async write on buffer block 0x23776f0. Retrying async write.


    I see the disks are not full:

    # df -hT
    Filesystem Type Size Used Avail Use% Mounted on
    /dev/xvda1 xfs 128G 2.7G 126G 3% /
    devtmpfs devtmpfs 16G 0 16G 0% /dev
    tmpfs tmpfs 16G 0 16G 0% /dev/shm
    tmpfs tmpfs 16G 649M 15G 5% /run
    tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup
    /dev/mapper/DataVG-DataLV xfs 205G 28G 178G 14% /srv
    tmpfs tmpfs 3.2G 0 3.2G 0% /run/user/0
    tmpfs tmpfs 3.2G 0 3.2G 0% /run/user/1001
    # df -h -i
    Filesystem Inodes IUsed IFree IUse% Mounted on
    /dev/xvda1 128M 55K 128M 1% /
    devtmpfs 4.0M 338 4.0M 1% /dev
    tmpfs 4.0M 1 4.0M 1% /dev/shm
    tmpfs 4.0M 397 4.0M 1% /run
    tmpfs 4.0M 16 4.0M 1% /sys/fs/cgroup
    /dev/mapper/DataVG-DataLV 205M 525K 205M 1% /srv
    tmpfs 4.0M 1 4.0M 1% /run/user/0



    I've tried to reboot the server, but it got stuck. My admins rebooted it via AWS Console, but after reboot LVM volume with /srv had disappeared:

    # df -hT
    Filesystem Type Size Used Avail Use% Mounted on
    /dev/xvda1 xfs 128G 2.9G 126G 3% /
    devtmpfs devtmpfs 16G 0 16G 0% /dev
    tmpfs tmpfs 16G 0 16G 0% /dev/shm
    tmpfs tmpfs 16G 17M 16G 1% /run
    tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup
    tmpfs tmpfs 3.2G 0 3.2G 0% /run/user/1001


    Attached Files
    Last edited by DmytroKh; 01-18-2018, 02:38 AM.

  • #2
    I must say this is the second time we've encountered the issue with lost /srv volume disappeared after reboot on high la (I wasn't on it on the first time, so didn't collect any data for post) . Also we've lost /srv on moving from t2 to m4 instance , and to another vpc, via image clone feature.

    My admins decided to run new instance with single volume:
    # df -hT
    Filesystem Type Size Used Avail Use% Mounted on
    /dev/xvda1 xfs 512G 25G 488G 5% /
    devtmpfs devtmpfs 16G 0 16G 0% /dev
    tmpfs tmpfs 16G 0 16G 0% /dev/shm
    tmpfs tmpfs 16G 137M 16G 1% /run
    tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup
    tmpfs tmpfs 3.2G 0 3.2G 0% /run/user/1001

    but smth went wrong, at least mysql setup incomplete: mysql root pass wasn't set, so I've found a temporary one from /var/log/mysql.log, and set root password to the one from /root/.my.cnf. next, I've found there is no orchestrator db and user, no user 'percona'@'localhost' and so on.

    Comment


    • #3
      This has happened to my PMM instance on EC2 a few times now also. Something seems pretty wrong where the instance can only run for about a week or two before it gets into this broken state.

      Comment


      • #4
        I had the same issue with two pmm instances using the market place AMI. After approx. a week to 10 days I found each pmm instance would become unresponsive. I found that the volume (/srv) has been thin provisioned in lvm and very quickly runs out of metadata space basically causing the /srv volume to become unwriteable. You can check the metadata space usage with lvs -a

        Comment

        Working...
        X