1. Cluster 로깅 방법 소개
구분 | 상세 설명 |
kubectl | • kubectl를 이용한 Cluster 정보 및 상태 확인 • EKS Cluster 내 Object에 대한 전체 정보 및 상태를 Dump후 확인 가능 |
Container Insights | • EKS Cluster에서 수집된 Metrics값을 종합해서 출력 • 그래프 및 연결 토폴로지 맵, 리스트 형태 현황 제공 |
Log groups | • EKS Cluster의 Log 수집 및 검색 가능 • 사용자가 접근이 어려운 EKS Control Plane(Master Node) Log 확인 가능 |
2. Cluster 이슈 사례 소개
구분 | 상세 설명 |
Cluster 자체 장애 | • EKS Cluster 자체가 기동이 안되는 사례 -> Key 암호화를 KMS로 걸었을 때 -> KMS 문제 발생시 (삭제, 기간 만료, 권한 변경 등) |
성능 이슈 | • EKS Control Plane 성능 문제로 느려질 때 -> EKS 클러스터 생성시 최소 2개 이상의 Subnet을 다른 AZ에서 생성해야함 -> 서비스 IPv4 범위가 Subnet IP 대역과 겹치면 안됨 |
업데이트 이슈 | • 업데이트가 느리거나 문제 있는 상황 -> 문제가 있을 경우 AWS Support를 통해서만 해결 가능 |
AWS 이슈 | • AWS DNS 서버 문제로 DNS 쿼리가 느려지거나 안될 때 발생 -> 몇년에 1번 생길수 있는 사례 |
3. Container Insight 연동을 위한 EKS 구성 방법
(1) K8s Manifest를 이용한 CloudWatch Agent 및 Fluent Bit 설치
- Chapter09 > Ch09_02-cluster-troubleshooting
$ kubectl apply ‒f cwagent-fluentbit.yaml
cwagent-fluentbit.yaml 파일의 내용은 아래와 같다.
# create amazon-cloudwatch namespace
apiVersion: v1
kind: Namespace
metadata:
name: amazon-cloudwatch
labels:
name: amazon-cloudwatch
---
# create cwagent service account and role binding
apiVersion: v1
kind: ServiceAccount
metadata:
name: cloudwatch-agent
namespace: amazon-cloudwatch
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cloudwatch-agent-role
rules:
- apiGroups: [""]
resources: ["pods", "nodes", "endpoints"]
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["list", "watch"]
- apiGroups: [""]
resources: ["nodes/proxy"]
verbs: ["get"]
- apiGroups: [""]
resources: ["nodes/stats", "configmaps", "events"]
verbs: ["create"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["cwagent-clusterleader"]
verbs: ["get","update"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: cloudwatch-agent-role-binding
subjects:
- kind: ServiceAccount
name: cloudwatch-agent
namespace: amazon-cloudwatch
roleRef:
kind: ClusterRole
name: cloudwatch-agent-role
apiGroup: rbac.authorization.k8s.io
---
# create configmap for cwagent config
apiVersion: v1
data:
# Configuration is in Json format. No matter what configure change you make,
# please keep the Json blob valid.
cwagentconfig.json: |
{
"agent": {
"region": "ap-northeast-2"
},
"logs": {
"metrics_collected": {
"kubernetes": {
"cluster_name": "<EKS 클러스터명>",
"metrics_collection_interval": 60
}
},
"force_flush_interval": 5
}
}
kind: ConfigMap
metadata:
name: cwagentconfig
namespace: amazon-cloudwatch
---
# deploy cwagent as daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cloudwatch-agent
namespace: amazon-cloudwatch
spec:
selector:
matchLabels:
name: cloudwatch-agent
template:
metadata:
labels:
name: cloudwatch-agent
spec:
containers:
- name: cloudwatch-agent
image: amazon/cloudwatch-agent:1.247350.0b251780
#ports:
# - containerPort: 8125
# hostPort: 8125
# protocol: UDP
resources:
limits:
cpu: 200m
memory: 200Mi
requests:
cpu: 200m
memory: 200Mi
# Please don't change below envs
env:
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: K8S_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CI_VERSION
value: "k8s/1.3.9"
# Please don't change the mountPath
volumeMounts:
- name: cwagentconfig
mountPath: /etc/cwagentconfig
- name: rootfs
mountPath: /rootfs
readOnly: true
- name: dockersock
mountPath: /var/run/docker.sock
readOnly: true
- name: varlibdocker
mountPath: /var/lib/docker
readOnly: true
- name: containerdsock
mountPath: /run/containerd/containerd.sock
readOnly: true
- name: sys
mountPath: /sys
readOnly: true
- name: devdisk
mountPath: /dev/disk
readOnly: true
volumes:
- name: cwagentconfig
configMap:
name: cwagentconfig
- name: rootfs
hostPath:
path: /
- name: dockersock
hostPath:
path: /var/run/docker.sock
- name: varlibdocker
hostPath:
path: /var/lib/docker
- name: containerdsock
hostPath:
path: /run/containerd/containerd.sock
- name: sys
hostPath:
path: /sys
- name: devdisk
hostPath:
path: /dev/disk/
terminationGracePeriodSeconds: 60
serviceAccountName: cloudwatch-agent
---
# create configmap for cluster name and aws region for CloudWatch Logs
# need to replace the placeholders <EKS 클러스터명> and ap-northeast-2
# and need to replace "On" and "2020"
# and need to replace "Off" and "On"
apiVersion: v1
data:
cluster.name: <EKS 클러스터명>
logs.region: ap-northeast-2
http.server: "On"
http.port: "2020"
read.head: "Off"
read.tail: "On"
kind: ConfigMap
metadata:
name: fluent-bit-cluster-info
namespace: amazon-cloudwatch
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: fluent-bit-role
rules:
- nonResourceURLs:
- /metrics
verbs:
- get
- apiGroups: [""]
resources:
- namespaces
- pods
- pods/logs
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: fluent-bit-role-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: fluent-bit-role
subjects:
- kind: ServiceAccount
name: fluent-bit
namespace: amazon-cloudwatch
---
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: amazon-cloudwatch
labels:
k8s-app: fluent-bit
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server ${HTTP_SERVER}
HTTP_Listen 0.0.0.0
HTTP_Port ${HTTP_PORT}
storage.path /var/fluent-bit/state/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 5M
@INCLUDE application-log.conf
@INCLUDE dataplane-log.conf
@INCLUDE host-log.conf
application-log.conf: |
[INPUT]
Name tail
Tag application.*
Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Path /var/log/containers/*.log
Docker_Mode On
Docker_Mode_Flush 5
Docker_Mode_Parser container_firstline
Parser docker
DB /var/fluent-bit/state/flb_container.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/fluent-bit*
Parser docker
DB /var/fluent-bit/state/flb_log.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag application.*
Path /var/log/containers/cloudwatch-agent*
Docker_Mode On
Docker_Mode_Flush 5
Docker_Mode_Parser cwagent_firstline
Parser docker
DB /var/fluent-bit/state/flb_cwagent.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name kubernetes
Match application.*
Kube_URL https://kubernetes.default.svc:443
Kube_Tag_Prefix application.var.log.containers.
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Labels Off
Annotations Off
[OUTPUT]
Name cloudwatch_logs
Match application.*
region ${AWS_REGION}
log_group_name /aws/containerinsights/${CLUSTER_NAME}/application
log_stream_prefix ${HOST_NAME}-
auto_create_group true
extra_user_agent container-insights
dataplane-log.conf: |
[INPUT]
Name systemd
Tag dataplane.systemd.*
Systemd_Filter _SYSTEMD_UNIT=docker.service
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
DB /var/fluent-bit/state/systemd.db
Path /var/log/journal
Read_From_Tail ${READ_FROM_TAIL}
[INPUT]
Name tail
Tag dataplane.tail.*
Path /var/log/containers/aws-node*, /var/log/containers/kube-proxy*
Docker_Mode On
Docker_Mode_Flush 5
Docker_Mode_Parser container_firstline
Parser docker
DB /var/fluent-bit/state/flb_dataplane_tail.db
Mem_Buf_Limit 50MB
Skip_Long_Lines On
Refresh_Interval 10
Rotate_Wait 30
storage.type filesystem
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name modify
Match dataplane.systemd.*
Rename _HOSTNAME hostname
Rename _SYSTEMD_UNIT systemd_unit
Rename MESSAGE message
Remove_regex ^((?!hostname|systemd_unit|message).)*$
[FILTER]
Name aws
Match dataplane.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match dataplane.*
region ${AWS_REGION}
log_group_name /aws/containerinsights/${CLUSTER_NAME}/dataplane
log_stream_prefix ${HOST_NAME}-
auto_create_group true
extra_user_agent container-insights
host-log.conf: |
[INPUT]
Name tail
Tag host.dmesg
Path /var/log/dmesg
Parser syslog
DB /var/fluent-bit/state/flb_dmesg.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.messages
Path /var/log/messages
Parser syslog
DB /var/fluent-bit/state/flb_messages.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[INPUT]
Name tail
Tag host.secure
Path /var/log/secure
Parser syslog
DB /var/fluent-bit/state/flb_secure.db
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Refresh_Interval 10
Read_from_Head ${READ_FROM_HEAD}
[FILTER]
Name aws
Match host.*
imds_version v1
[OUTPUT]
Name cloudwatch_logs
Match host.*
region ${AWS_REGION}
log_group_name /aws/containerinsights/${CLUSTER_NAME}/host
log_stream_prefix ${HOST_NAME}.
auto_create_group true
extra_user_agent container-insights
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name syslog
Format regex
Regex ^(?<time>[^ ]* {1,2}[^ ]* [^ ]*) (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?(?:[^\:]*\:)? *(?<message>.*)$
Time_Key time
Time_Format %b %d %H:%M:%S
[PARSER]
Name container_firstline
Format regex
Regex (?<log>(?<="log":")\S(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
[PARSER]
Name cwagent_firstline
Format regex
Regex (?<log>(?<="log":")\d{4}[\/-]\d{1,2}[\/-]\d{1,2}[ T]\d{2}:\d{2}:\d{2}(?!\.).*?)(?<!\\)".*(?<stream>(?<="stream":").*?)".*(?<time>\d{4}-\d{1,2}-\d{1,2}T\d{2}:\d{2}:\d{2}\.\w*).*(?=})
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: amazon-cloudwatch
labels:
k8s-app: fluent-bit
version: v1
kubernetes.io/cluster-service: "true"
spec:
selector:
matchLabels:
k8s-app: fluent-bit
template:
metadata:
labels:
k8s-app: fluent-bit
version: v1
kubernetes.io/cluster-service: "true"
spec:
containers:
- name: fluent-bit
image: amazon/aws-for-fluent-bit:2.10.0
imagePullPolicy: Always
env:
- name: AWS_REGION
valueFrom:
configMapKeyRef:
name: fluent-bit-cluster-info
key: logs.region
- name: CLUSTER_NAME
valueFrom:
configMapKeyRef:
name: fluent-bit-cluster-info
key: cluster.name
- name: HTTP_SERVER
valueFrom:
configMapKeyRef:
name: fluent-bit-cluster-info
key: http.server
- name: HTTP_PORT
valueFrom:
configMapKeyRef:
name: fluent-bit-cluster-info
key: http.port
- name: READ_FROM_HEAD
valueFrom:
configMapKeyRef:
name: fluent-bit-cluster-info
key: read.head
- name: READ_FROM_TAIL
valueFrom:
configMapKeyRef:
name: fluent-bit-cluster-info
key: read.tail
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CI_VERSION
value: "k8s/1.3.9"
resources:
limits:
memory: 200Mi
requests:
cpu: 500m
memory: 100Mi
volumeMounts:
# Please don't change below read-only permissions
- name: fluentbitstate
mountPath: /var/fluent-bit/state
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
- name: runlogjournal
mountPath: /run/log/journal
readOnly: true
- name: dmesg
mountPath: /var/log/dmesg
readOnly: true
terminationGracePeriodSeconds: 10
volumes:
- name: fluentbitstate
hostPath:
path: /var/fluent-bit/state
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
- name: runlogjournal
hostPath:
path: /run/log/journal
- name: dmesg
hostPath:
path: /var/log/dmesg
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
- operator: "Exists"
effect: "NoExecute"
- operator: "Exists"
effect: "NoSchedule"
(2) CloudWatch Agent 및 Fluent Bit 설치 확인
$ kubectl get all -n amazon-cloudwatch
4. Cluster 로깅 방법 실습
(1) kubectl을 통한 kubernetes cluster 상태 확인
$ kubectl cluster-info
(2) kubernetes cluster의 현재 기준 모든 Resource 및 Object 상태 정보 확인
$ kubectl cluster-info dump
(3) Container Insight를 통한 kubernetes cluster 메트릭 모니터링 현황 확인
- AWS CloudWatch > 인사이트 > Container Insights
(4) Log Group을 통한 kubernetes cluster 로그 수집 현황 확인
- AWS CloudWatch > 로그 > 로그 그룹 > /aws/eks/test-eks-cluster/cluster
5. Cluster 이슈 사례 발생시 확인 방법 실습
(1) EKS의 Key 관리를 KMS에서 처리하고 있는지 확인
- AWS EKS > 클러스터 > (생성된 Cluster명) > 구성 > 세부 정보 > 암호 암호화
(2) KMS 적용 했다면 다음의 경로에서 확인 가능
- AWS Key Management Service (KMS)> 고객 관리형 키
(3) EKS의 Subnet이 몇개 생성 되어 있는지 확인
- AWS EKS > 클러스터 > (생성된 Cluster명) > 구성 > 네트워킹 > 서브넷
(4) EKS의 서비스 IPv4 범위 확인
- AWS EKS > 클러스터 > (생성된 Cluster명) > 구성 > 네트워킹 > 서비스 IPv4 범위
(5) EKS의 업데이트 진행시
- AWS EKS > 클러스터 > (생성된 Cluster명) > 구성 > 클러스터 구성
(6) EKS의 업데이트 진행시 문제가 발생될 때 AWS Support 요청 방법
- AWS Support > Open support cases > Create case > Technical support
(7) AWS 이슈 확인 방법
- AWS Health Dashboard > General service events > Open and recent issues
'MSA > Part4. Ch.9 Kubernetes 트러블 슈팅 방법' 카테고리의 다른 글
06. [실습] DNS 로깅 및 이슈 사례 조치방법 (0) | 2023.02.10 |
---|---|
05. [개정판][실습] 보안관련 로깅 및 이슈 사례 조치방법 (0) | 2023.02.09 |
04. [실습] POD 로깅 및 이슈 사례 조치방법 (0) | 2023.02.09 |
03. [실습] Node 로깅 및 이슈 사례 조치방법 (0) | 2023.02.09 |