How do I fix back-up restarting failed container Kubernetes?

CrashLoopBackOff might sound like the name of a grunge band from the 1990s, or it might be a subtle reference to a Ringo Starr song from the 1970s. However, CrashLoopBackOffs are actually entirely related to Kubernetes performance and dependability, and have nothing to do with the music business. Applications that rely on Kubernetes Pods may not function properly until you resolve the CrashLoopBackOff condition, which can occur for a number of reasons.

To learn more about CrashLoopBackOff errors, their causes, and how to solve them so that your Pods may be up and running again as soon as possible, continue reading.

What is Kubernetes CrashLoopBackOff?

When containers inside of a Kubernetes Pod restart and then crash repeatedly, it's known as a CrashLoopBackOff fault. When this happens, Kubernetes will start adding a backoff period, or delay, between restarts in an attempt to give administrators time to fix whatever problem is causing the crashes to occur again. Additionally, it typically generates an error message mentioning "back off restarting failed container."

Therefore, although CrashLoopBackOff might seem like a strange name, it makes sense when you consider this: This describes a situation in which your containers repeatedly crash, alternating with backoff intervals.

Causes of CrashLoopBackOff?

There are numerous possible causes for CrashLoopBackOff states, despite the fact that they all lead to the same kind of issue—a never-ending cycle of crashes.

ImagePullBackOff: If an image cannot be pulled, your containers may not be starting correctly.

OutOfMemory (OOM): It's possible that the containers are using more memory than is permitted.

Errors in configuration: Incorrect environment variables or command parameters may be the cause of a crash loop.

Application bugs: The containers may be crashing soon after they start due to errors in the application code.

Persistent storage configuration problems: Containers may not start correctly if there is a problem accessing persistent storage volumes, such as a wrong route to the resources.

Resources that are locked: Occasionally, a database or file is locked because it is being used by another Pod. A CrashLoopBackOff may result from this if a fresh Pod tries to use it.

Read-only resources: In the same way, an error may arise from resources that are set to read-only because Pods are unable to write to them.

Incorrect permission settings: Incorrect permissions may also prevent resources from being used, which could cause a CrashLoopBackOff.

Network connectivity issues: Crashes may be caused by network problems, which may arise from either a network failure or a networking setup issue.

What does a CrashLoopBackOff error look like in Kubernetes?

When a CrashLoopBackOff event happens, your Kubernetes console does not display a red light. You are not specifically alerted about the problem by Kubernetes.

However, you can determine that it's occurring by using a command similar to this one to check the status of your Pods:

kubectl get pods -n demo-ng

(demo-ng is the namespace we’ll be targeting in this guide. As the name suggests, it’s a demo setup).

If you see results like the following, you'll know you have a CrashLoopBackOff issue:

NAME READY STATUS RESTARTS AGE

adservice-79f74f8b7-92lqr 1/1 Running 14 (5m25s ago) 42m

cartservice-74d857d84-ddhsv 1/1 Running 15 (95s ago) 42m

checkoutservice-7db49c4d49-7cv5d 0/1 CrashLoopBackOff 16 (106s ago) 42m

currency-service-7b4d8694d4-tttrz 1/1 Running 0 42m

currencyservice-75df859666-5858z 0/1 CrashLoopBackOff 16 (100s ago) 42m

email-service-7995bddbb9-774fv 1/1 Running 0 42m

emailservice-865447b4d8-xs48l 1/1 Running 3 (31m ago) 42m

eventfetcherservice-575cb8f9fc-g5spx 1/1 Running 0 42m

favorites-service-7b9dcfb85d-z9gzk 1/1 Running 0 42m

If you see Pods that are not in the Ready state or that have restarted multiple times, you might also have a CrashLoopBackOff issue (which you can verify by examining the restart count in the RESTARTS column of the kubectl output). Even though Kubernetes doesn't specifically identify the Pod status as being CrashLoopBackOff, such conditions may nevertheless indicate a CrashLoopBackOff issue.

How to troubleshoot and fix CrashLoopBackOff errors

Of course, identifying your CrashLoopBackOff problem is just the first step in fixing it. To diagnose and resolve CrashLoopBackOff issues, you'll need to get further into the problem.

Use kubectl describe pod to check the pod description.

Using the kubectl describe pod command, get as much information as you can about the Pod as the initial step in this procedure. For instance, to troubleshoot the CrashLoopBackOff failures in the checkoutservice-7db49c4d49-7cv5d Pod that we previously observed, perform the following commands:

kubectl describe pod -n demo-ng checkoutservice-7db49c4d49-7cv5d

You will see the following results

kubectl describe pod -n demo-ng checkoutservice-7db49c4d49-7cv5d

You'd see output like the following:

Name: checkoutservice-7db49c4d49-7cv5d

Namespace: demo-ng

Priority: 0

Node: ip-172-31-21-64.eu-west-3.compute.internal/172.31.21.64

Start Time: Mon, 06 Feb 2023 13:24:51 +0200

Labels: app=checkoutservice

app.kubernetes.io/version=9ed644f

pod-template-hash=7db49c4d49

skaffold.dev/run-id=4cf3c87d-76f0-498a-8616-80276ea71596

Annotations: app.kubernetes.io/repo: https://github.com/groundcover-com/demo-app

kubernetes.io/psp: eks.privileged

Status: Running

IP: 172.31.28.25

IPs:

IP: 172.31.28.25

Controlled By: ReplicaSet/checkoutservice-7db49c4d49

Containers:

checkoutservice:

Container ID: docker://1f34cd7a1a44473c0521d53fc082eb7171edf666925048e0051ac5654d7a2fd1

Image: 125608480246.dkr.ecr.eu-west-3.amazonaws.com/checkoutservice:9ed644f@sha256:a9d3f3bf80b3c31291595733025b8a6e0a87796b410d7baacc5bd9eee95dd180

Image ID: docker-pullable://125608480246.dkr.ecr.eu-west-3.amazonaws.com/checkoutservice@sha256:a9d3f3bf80b3c31291595733025b8a6e0a87796b410d7baacc5bd9eee95dd180

Port: 5050/TCP

Host Port: 0/TCP

State: Waiting

Reason: CrashLoopBackOff

Last State: Terminated

Reason: Error

Exit Code: 2

Started: Mon, 06 Feb 2023 14:05:35 +0200

Finished: Mon, 06 Feb 2023 14:06:03 +0200

Ready: False

Restart Count: 16

Limits:

cpu: 200m

memory: 128Mi

Requests:

cpu: 100m

memory: 64Mi

Liveness: exec [/bin/grpc_health_probe -addr=:5050] delay=0s timeout=1s period=10s #success=1 #failure=3

Readiness: exec [/bin/grpc_health_probe -addr=:5050] delay=0s timeout=1s period=10s #success=1 #failure=3

Environment:

PORT: 5050

PRODUCT_CATALOG_SERVICE_ADDR: productcatalogservice:3550

SHIPPING_SERVICE_ADDR: shippingservice:50051

PAYMENT_SERVIC

In reviewing the output, pay particular attention to the following:

The pod definition.
The container.
The image pulled for the container.
Resources allocated for the container.
Wrong or missing arguments.

Check Event Pods

It's also possible to get this information using the kubectl getevents command:

kubectl get events -n demo-ng --field-selector involvedObject.name=checkoutservice-7db49c4d49-7cv5d

Which generates output like

Dig through container and Pod logs with kubectl logs

We currently understand that the restarts are caused by a failed container readiness probe and recurrent "back off restarting failed container" restarts, but we are still unsure of the precise causes of those events.

{"message":"starting to load data…","severity":"info","timestamp":"2023-02-06T12:11:49.985473135Z"}

{"message":"[PlaceOrder] user_id=\"d7f14948-0863-4ae9-a26e-92eeb1fbb7ae\" user_currency=\"USD\"","severity":"info","timestamp":"2023-02-06T12:13:36.450163905Z"}

{"message":"executing data currency conversion script"","severity":"info","timestamp":"2023-02-06T12:11:49.986148642Z"}

{"message":"no such file or directory: /Users/my_user/bin/currency_script.sh","severity":"error","timestamp":"2023-02-06T12:13:07.597217324Z"}

Alas, this may not be helpful in a lot of situations. However, in our case, an application error suggests that the program is attempting to execute an unfound script, which is the primary cause of our CrashLoopBackOff. That ought to be enough in this instance to assist us in resolving the problem and fixing the CrashLoopBackoff.

Looking at the ReplicaSet

In other situations, we'll have to continue digging as container and Pod logs won't help. Examining the workload that regulates our Pod will allow us to look for more clues.

There may be a misconfiguration in the ReplicaSet managing our Pod in question, which is the reason for the CrashLoopBackOff.

Why Choose Supportfly

You would follow the troubleshooting process we just went through if you enjoy anguish and suffering.

On the other side, hiring Supportfly to debug CrashLoopBackOff situations would be more appropriate if you're the type of person who loves sunshine and rainbows. All of your clusters' apps and Kubernetes infrastructure components are continuously monitored by Supportfly. Events that could point to a CrashloopBackoff mistake can be tracked from a single location. In addition, you have access to logs from Pod containers, can check infrastructure monitoring data that can point to memory or CPU problems, and can compare the number of running Pods to the desired state. This information makes it easier to identify the underlying cause of CrashLoopBackOff issues.

Conclusion

In Kubernetes, troubleshooting a failing or restarting container can be difficult, but by using a methodical approach, you can typically find and resolve the underlying problem. Review resource restrictions, logs, and the pod status before moving on. Check the readiness and liveness probes; also, make sure to look for any recent modifications to the container image or network problems. You can restore the proper operation of your container by being persistent and patient.