A few weeks ago a wrote an article praising the simplicity of installing a Rancher management cluster. I tried it again last week and ran into some issues. Some silly, some a bit convoluted. Anyway, this guide covers some of the mistakes I made and the issues I ran into.
To begin with, here’s my rancher-cluster.yml:
nodes: - address: 192.168.1.11 user: rancher role: [controlplane, worker, etcd] ssh_key_path: ~/.ssh/id_rsa - address: 192.168.1.12 user: rancher role: [controlplane, worker, etcd] ssh_key_path: ~/.ssh/id_rsa - address: 192.168.1.13 user: rancher ssh_key_path: ~/.ssh/id_rsa role: [controlplane, worker, etcd] services: etcd: snapshot: true creation: 6h retention: 24h ingress: provider: nginx options: use-forwarded-headers: "true"
Not being able to connect via SSH
Rancher says it’s unable to connect to the other nodes
Failed to set up SSH tunneling for host [192.168.70.233]: Can't retrieve Docker Info
- Make sure that you can connect using the defined SSH key with the user you want to run rancher as to all of the intended K8s nodes. In the example above I would test connecting to each of the nodes as the user rancher using ~/.ssh/id_rsa. Additional note is that some guides might create dsa keys, in which case you’ll have to change the name to id_dsa in the config. Might sound obvious but the difference is subtle.
- Rancher is connecting to Docker via a local socket. In order for this to work you need to enable TCP Forwarding. This is likely the reason is the output from RKE up says the tunnel has been created but that it can’t reach docker. Fix this by editing /etc/ssh/sshd_config and making sure that the following line is not commented out:
Old config and network firewall complaints
Rancher says that the nodes probably has their firewalls enabled or that there’s network issues.
[network] Host [192.168.x.y] is not able to connect to the following ports: [192.168.x.y:2379]. Please check network policies and firewall rules
- This could be the case of an old configuration still being around. Try to clean/erase old config by running:
rke remove –config ./rancher-cluster.yml
- Still failing? Try this command to skip the network checks:
rke up –config ./rancher-cluster.yml –disable-port-check
Rancher says that is is denied mounting different mount source paths
error while creating mount source path '/var/lib/etcd': mkdir /var/lib/etcd: permission denied
This is likely an issue with conflicting docker versions being installed. I followed the official installation instructions but for some reason the snap version of docker was still installed on all my nodes. Removing the snap version of docker did the trick:
sudo snap remove docker –purge
etcd health check fails
Rancher says that the etcd cluster health failed and the cluster initialization fails.
rancher Error response from daemon: error while creating mount source path '/var/lib/etcd': mkdir /var/lib/etcd: read-only file system
Make sure that all the nodes are resolvable by hostname via DNS. For example if node 1 has the hostname of ranchermgmt-01.domain.com the system DNS on each node should be able to resolve ranchermgmt-01.domain.com to the servers IP address.
These are all the issues I ran into. Do you have more, or do you have other possible solutions you wish to share? Let me know and I’ll happily update the guide!