4/24/2014

glusterd を再起動したら失敗するようになった問題→とりあえず復旧させてみた

OpenSSLの脆弱性がでたので、Webサーバ、メールサーバ、ほかSSH開けてるとこなんぞにパッチ当てる→サービス再起動なんていうのをしばらくやっていくうちに、だんだん面倒になってきて、OSごと再起動したものがありました。
そしたら、なぜかglusterdがあがらなくなるという問題に遭遇したので、メモしておきます。ちなみに、OpenSSLのパッチとは無関係です。

●ログ

[[email protected] ~]$ sudo less /var/log/glusterfs/etc-glusterfs-glusterd.vol.log

[2014-04-11 01:04:49.486482] I [glusterfsd.c:1910:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.4.2 (/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)
[2014-04-11 01:04:49.494405] I [glusterd.c:961:init] 0-management: Using /var/lib/glusterd as working directory
[2014-04-11 01:04:49.497892] I [socket.c:3480:socket_init] 0-socket.management: SSL support is NOT enabled
[2014-04-11 01:04:49.497918] I [socket.c:3495:socket_init] 0-socket.management: using system polling thread
[2014-04-11 01:04:49.498045] E [rpc-transport.c:253:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.4.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory
[2014-04-11 01:04:49.498071] W [rpc-transport.c:257:rpc_transport_load] 0-rpc-transport: volume 'rdma.management': transport-type 'rdma' is not valid or not found on this machine
[2014-04-11 01:04:49.498084] W [rpcsvc.c:1389:rpcsvc_transport_create] 0-rpc-service: cannot create listener, initing the transport failed
[2014-04-11 01:04:51.021829] I [glusterd-store.c:1339:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 2
[2014-04-11 01:04:51.029440] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-0
[2014-04-11 01:04:51.029475] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-1
[2014-04-11 01:04:51.594617] I [glusterd-handler.c:2818:glusterd_friend_add] 0-management: connect returned 0
[2014-04-11 01:04:51.594755] I [rpc-clnt.c:962:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-04-11 01:04:51.594834] I [socket.c:3480:socket_init] 0-management: SSL support is NOT enabled
[2014-04-11 01:04:51.594851] I [socket.c:3495:socket_init] 0-management: using system polling thread
[2014-04-11 01:04:51.603866] I [glusterd.c:125:glusterd_uuid_init] 0-management: retrieved UUID: 553eb388-83e6-43bd-86b5-0b70cd1ac716
[2014-04-11 01:04:51.607534] E [glusterd-store.c:2487:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore
[2014-04-11 01:04:51.607567] E [xlator.c:390:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2014-04-11 01:04:51.607588] E [graph.c:292:glusterfs_graph_init] 0-management: initializing translator failed
[2014-04-11 01:04:51.607601] E [graph.c:479:glusterfs_graph_activate] 0-graph: init failed
[2014-04-11 01:04:51.607768] W [glusterfsd.c:1002:cleanup_and_exit] (-->/usr/sbin/glusterd(main+0x5d2) [0x406802] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xb7) [0x4051b7] (-->/usr/sbin/glusterd(glusterfs_process_volfp+0x103) [0x4050c3]))) 0-: received signum (0), shutting down

●まずは、[2014-04-11 01:04:49.498045] E [rpc-transport.c:253:rpc_transport_load] 0-rpc-transport: /usr/lib64/glusterfs/3.4.2/rpc-transport/rdma.so: cannot open shared object file: No such file or directory に対処。

これっておそらく今回の問題に当たる前から出てたんだと思う。rdma使ってないので、設定を修正。

[[email protected] ~]$ diff -Nru /etc/glusterfs/glusterd.vol{.back,}
--- /etc/glusterfs/glusterd.vol.back 2014-01-03 12:38:22.000000000 +0000
+++ /etc/glusterfs/glusterd.vol 2014-04-11 01:07:15.819054736 +0000
@@ -1,7 +1,7 @@
 volume management
     type mgmt/glusterd
     option working-directory /var/lib/glusterd
-    option transport-type socket,rdma
+    option transport-type socket
     option transport.socket.keepalive-time 10
     option transport.socket.keepalive-interval 2
     option transport.socket.read-fail-log off

そしてもっかい起動してみる。
[[email protected] ~]$ sudo /etc/init.d/glusterd start
                                         [FAILED]

ほう。ログ見ると、rdmaのエラーは消えたけど、、、

[2014-04-11 01:08:05.016191] I [glusterfsd.c:1910:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 3.4.2 (/usr/sbin/glusterd --pid-file=/var/run/glusterd.pid)
[2014-04-11 01:08:05.024142] I [glusterd.c:961:init] 0-management: Using /var/lib/glusterd as working directory
[2014-04-11 01:08:05.027698] I [socket.c:3480:socket_init] 0-socket.management: SSL support is NOT enabled
[2014-04-11 01:08:05.027726] I [socket.c:3495:socket_init] 0-socket.management: using system polling thread
[2014-04-11 01:08:06.544738] I [glusterd-store.c:1339:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 2
[2014-04-11 01:08:06.548780] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-0
[2014-04-11 01:08:06.548823] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-1
[2014-04-11 01:08:07.119147] I [glusterd-handler.c:2818:glusterd_friend_add] 0-management: connect returned 0
[2014-04-11 01:08:07.119290] I [rpc-clnt.c:962:rpc_clnt_connection_init] 0-management: setting frame-timeout to 600
[2014-04-11 01:08:07.119375] I [socket.c:3480:socket_init] 0-management: SSL support is NOT enabled
[2014-04-11 01:08:07.119391] I [socket.c:3495:socket_init] 0-management: using system polling thread
[2014-04-11 01:08:07.128143] I [glusterd.c:125:glusterd_uuid_init] 0-management: retrieved UUID: 553eb388-83e6-43bd-86b5-0b70cd1ac716
[2014-04-11 01:08:07.130647] E [glusterd-store.c:2487:glusterd_resolve_all_bricks] 0-glusterd: resolve brick failed in restore
[2014-04-11 01:08:07.130679] E [xlator.c:390:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2014-04-11 01:08:07.130695] E [graph.c:292:glusterfs_graph_init] 0-management: initializing translator failed
[2014-04-11 01:08:07.130706] E [graph.c:479:glusterfs_graph_activate] 0-graph: init failed
[2014-04-11 01:08:07.130880] W [glusterfsd.c:1002:cleanup_and_exit] (-->/usr/sbin/glusterd(main+0x5d2) [0x406802] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xb7) [0x4051b7] (-->/usr/sbin/glusterd(glusterfs_process_volfp+0x103) [0x4050c3]))) 0-: received signum (0), shutting down


●なんぞこれ?
[2014-04-11 01:08:06.548780] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-0
[2014-04-11 01:08:06.548823] E [glusterd-store.c:1858:glusterd_store_retrieve_volume] 0-: Unknown key: brick-1

MLアーカイブでいくつかヒットするんだけど、どうもハズしてそうなのと、未解決なまま残っているのと・・・。
http://www.gluster.org/pipermail/gluster-users/2013-August/036984.html
http://www.gluster.org/pipermail/gluster-users/2013-January/035249.html
http://gluster.org/pipermail/gluster-users/2012-August/011189.html


原因究明のためにAWSでスナップショットとっといて、とりあえず復旧優先で進めます。

●ボリュームを再作成する
[[email protected] ~]$ sudo cp -rp /var/lib/glusterd{,.back}
[[email protected] ~]$ sudo diff -Nru /var/lib/glusterd{.back,}
[[email protected] ~]$ sudo gluster peer probe ec2-****

$ sudo gluster volume heal vol01 info
→ あかん
$ sudo /etc/init.d/nginx stop
$ sudo umount /var/www
$ sudo /etc/init.d/glusterfsd stop
$ sudo gluster volume stop vol01
$ sudo gluster volume delete vol01
$ sudo gluster volume create vol01 replica 2 transport tcp ec2-54-238-57-107.ap-northeast-1.compute.amazonaws.com:/data/brick01 ec2-176-34-56-112.ap-northeast-1.compute.amazonaws.com:/data/brick01
volume create: vol01: failed: /data/brick01 or a prefix of it is already part of a volume
→怒られた

$ sudo rm -rf /data/brick01.glusterfs
$ sudo setfattr -x trusted.glusterfs.volume-id /data/brick01
$ sudo setfattr -x trusted.gfid /data/brick01

[[email protected] ~]$ sudo rm -rf /data/brick01.glusterfs
[[email protected] ~]$ sudo setfattr -x trusted.glusterfs.volume-id /data/brick01
[[email protected] ~]$ sudo setfattr -x trusted.gfid /data/brick01

$ sudo gluster volume create vol01 replica 2 transport tcp ec2-54-238-57-107.ap-northeast-1.compute.amazonaws.com:/data/brick01 ec2-176-34-56-112.ap-northeast-1.compute.amazonaws.com:/data/brick01
volume create: vol01: success: please start the volume to access data
→ 成功

[[email protected] ~]$ sudo gluster volume start vol01
volume start: vol01: success
[[email protected] ~]$ sudo gluster volume status
Status of volume: vol01
Gluster process Port Online Pid
------------------------------------------------------------------------------
Brick ec2-54-238-57-107.ap-northeast-1.compute.amazonaw
s.com:/data/brick01 49152 Y 3379
Brick ec2-176-34-56-112.ap-northeast-1.compute.amazonaw
s.com:/data/brick01 49153 Y 3370
NFS Server on localhost 2049 Y 3387
Self-heal Daemon on localhost N/A Y 3382
NFS Server on ec2-54-238-57-107.ap-northeast-1.compute.
amazonaws.com 2049 Y 3395
Self-heal Daemon on ec2-54-238-57-107.ap-northeast-1.co
mpute.amazonaws.com N/A Y 3391

There are no active volume tasks
[[email protected] ~]$ sudo /etc/init.d/glusterfsd start
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/xvda1       8256952 1164492   7008576  15% /
tmpfs             304144       0    304144   0% /dev/shm
/dev/xvdb1      20308020  203844  19055940   2% /data
[[email protected] ~]$ sudo /etc/init.d/glusterfsd status
glusterfsd (pid 3370) is running...
[[email protected] ~]$ sudo mount /var/www
Filesystem                                                    1K-blocks    Used Available Use% Mounted on
/dev/xvda1                                                      8256952 1164516   7008552  15% /
tmpfs                                                            304144       0    304144   0% /dev/shm
/dev/xvdb1                                                     20308020  203844  19055940   2% /data
ec2-176-34-56-112.ap-northeast-1.compute.amazonaws.com:/vol01  20307968  203904  19055872   2% /var/www
[[email protected] ~]$ sudo /etc/init.d/nginx start
Starting nginx:                                            [  OK  ]


ひとまず。