
Kubex Product Walkthrough Video
Understand Full Stack
Optimizes safely from container to pods to the nodes they run on
Delivers Realizable Gains
Recommends only what will improve performance and cost and omits anything that won’t
Reliable Automation
Acts directly or connects to your preferred methods and systems
See the Benefits of Optimized KUBERNETES Resources
Kubex is AI-driven analytics that precisely determine optimal resource settings for Kubernetes
[Video transcript]
Welcome to a walkthrough of the new Kubex product from Densify.
What you’re looking at is the new Kubex UI, and I’ll give a walkthrough of all the things that you’re looking at.
First of all, the main screen, what we’re seeing here is a histogram of all the containers in this environment telling me how correct or incorrect they are from a resource management perspective.
And you’ll see that it’s divided into CPU requests, memory requests, CPU limits, and memory limits. And the way you interpret this is that on these histograms, the number of containers in each section is shown by the size of the bar. So, for example, for CPU requests in this environment, there are about 1,200 containers. It’s not a really huge environment. It’s just a lab. But we’re saying about 217 of those containers are set correctly. The CPU requests are set appropriate for their utilization levels based on policy.
At the far left, you see a gray bar. Those are containers that don’t have a CPU request value. And in this case, 243 containers are missing that setting, which can cause risk and overstacking of the nodes.
Beside it, you also see containers that are too small. In this case, they’re one fifth the size they should be based on their utilization settings. Ranging all the way up to ones that are too big, maybe twice the size they should be. And ultimately, this big yellow bar. Are ones that are 5X too big.
And that’s very common for us to see where there is just a lot of stranded CPU out there based on utilization. You do not need to get these containers this much CPU. So this big yellow bar for CPU request is often a marker that there is a cost saving opportunity in the environment. And again, I’m looking at the entire environment here.
If I click my way down this list on the left, I can look at individual environments. I can look at clusters. I can open these up and look at individual namespaces. If I want to, here’s a large namespace. I can see, I can navigate around and look at the health and an opportunity across all my environment.
So again, I’m just going to get everything here. If you have a big company, you can see your entire container state in one picture.
So in CPU, we typically see a big yellow bar, meaning there’s cost savings opportunities for memory. It’s a little more nuanced. There is a big yellow bar, but there’s also an even bigger gray bar and a lot of red. And we see this quite a bit where memory is kind of all over the place. Some of them are way too big. Some of them way too small. And this can have a profound impact on your operation because Kubernetes doesn’t know how to stack these things up. If you say you’re going to use a gig and you use five, you know, that’d be like, you’d be down in this bucket here.
If you might end up overstacking the node and having out of memory kills. If you say that you only you only, you need five, but you only use one, then you’re stranding a lot of capacity. And so we usually see all of the above happening in any given node group. And in this case, we also have a big large number of containers that don’t have memory requests at all.
And these are like dark matter. They run in the environment, chewing up memory, but Kubernetes isn’t aware of what they’re doing. And it can to lead to nodes getting overstacked. And that’s exactly what this little warning is up here. So we’ve added the algorithms to automatically detect this and say, in your environment, you have risk of out of memory kills because there are nodes running out of memory.And I’ll come back to that in a minute.
So CPU cost problem. Memory may be a cost problem. There is overall a surplus of memory in this environment, so there probably is a cost saving opportunity. But you got to fix this red and the gray first, because you might already be having operational issues.
Now CPU limits, we don’t tend to focus on as much. We find customers don’t necessarily set these. It’s optional. We do give recommendations, but it’s not a key area. But memory limits are, and there’s two things going on here. Yellow isn’t the end of the world here it just means, you’re giving a lot of free reign to get pretty big if it wants to.
Gray is a problem. It means it’s unconstrained. If you have a memory leak in a container, it can take over entire node. We’ve seen that happening. So that’s a pretty important one to fix.
But the worst one here are these little bit of red at the bottom here. These are containers where their limits, they have limits, but the limits are too low. And if they hit their limit, they just get killed. Well, specifically the Linux kernel will kill a process inside your container and it might restart. It might not. But that’s not a good thing to be happening. And so you never want to see this down there. And we are also directly detecting that here. We’re saying, yep, we are detecting memory limit events causing restarts in the environment.
So before I drill down to that, so this is a, at-a-glance view, it’s like an MRI for your whole environment. Again, we run this against tens or even hundreds of thousands of containers at once. And you can, at-a-glance, understand what on earth is happening. And it’s complicated because you might look okay overall, on average, you have the right amount of memory. But it’s all in the wrong places. It’s all misallocated causing operational risk. So, even in the most sophisticated environments, we find all of these problems occurring. So this is the histogram.
I also have a summary tab at the top here. You notice these tabs across the top. If I go there, I can get a simpler view by ranking. I can rank them by how big the environments are. I can see which ones have the most containers at risk, which ones have the most waste. And so this is really useful. If I click down into maybe the bigger environment here, I will then get a ranking of the namespaces by their risk or by their waste. And so I can navigate this way as well. I can just use my top end lists at the environment level in this tree to make my way through the data.
But I’m going to go back up to the top here to the histogram. And the third tab here is the AI analysis details. That’s where we get into the details, the analysis and the recommendations. And I can click there. But the way I like to get there is just by clicking on the histogram so I can click on any of these bars or any of these warning signs or any of these links to get to the relevant information.
And what I’m going to start with is this memory limit one because it’s not a cost saving thing. It’s just a nasty thing that you probably want to fix first. So if I click here, the navigation will take you down into this AI analysis details page, which is giving me a tailored view of exactly what the problems are from a memory limit perspective. And so let me describe what we’re looking at here. This is a table of the containers with all the identifying information, how many of them are running, what their memory limits are, what the recommendations are. And you can see here, a lot of these ones, they’re too small. A negative surplus means that you don’t have a high enough limit. And this ‘yes’ on the far right means you are hitting your limit. So we actually have algorithms to detect that, and we can see that visually at the bottom here. So we’re detecting it and we see the number of restarts here. If I go down to the bottom here and open up one of these curves, I will also see it right here.
So what we’re looking at here is – you’ll see across the top. I’m looking at the machine learning model. We can also look at historical data. I’ll cover that later, but I’m looking at the ML model, and this is a 24 hour pattern model of what this container typically does. And this is this case, the memory utilization of the average container in the replica set. And this one, I think it’s just one. So it’s not replicating. And by hour, you can see this kind of candlestick chart where the minimum, it goes as low as around almost just over 20 MB. It’s going as high as around 95 MB. I can click here and see what the actual numbers are in the tool tip, but that’s the range of operation each hour of the day by doing machine learning on the history. And typically we look at about 95 days of history to understand what does this workload do?
Now the purple line is the limit. I’ll just go down here and show you that the request is that level. The limit is also at that level. We’re recommending a little bit higher request, but a much higher limit because this thing is hitting its limit.
When you see the line tapping that purple line it means you’re hitting your limit. It looks like it’s hitting it quite often and it might be causing problems. And by scrolling to the right I can see all kinds of different workload stats in here, all the different memory CPU stats. I can also scroll around and you can see here, this thing’s restarting every hour of the day.
So usually containers are architected to survive a certain level of restarting, but maybe not this much restarting, maybe not constant restarting. It might be a good idea to fix this memory limit. So that’s an example of drilling down. I’ve just, with one click from the histogram, I got down to my top ranked problems from a memory limit perspective. I can see it all here. I can drill down on it. Again, very rapid for risk mitigation. This isn’t going to save you any money. It’s not going to cost you any money to fix it. You just should fix the limits. So that’s the first thing I wanted to show from the histogram.
So again, we see this, it’s every environment we analyze. I think I’ve only ever seen one environment that didn’t have this happening. And it’s because everything was so big, it was just way oversized. That’s the memory limits. Now, if I come up to the memory request, again, there’s a little more nuance here. We might have some cost savings, but we want to fix these problems first.
So for this, I’m going to click on this. I could click in multiple places. Actually, I’ll click on this warning as well. And that’s going to take me right down to the view of the the table view of all the top memory risks. Where in this case, for example, the first row, you didn’t give it a memory It’s using about 8 gig of memory. I can see that down here. Again, if I look at the memory usage, it’s about 7. 58 gig peeking up to our 9 gig. And we’re suggesting you should give it a request value at least to cover sustained activity. So this is way too small and it’s creating a shortfall and that can cause a problem on the nodes. And if I scroll to the right, we can see here, for example you see down the list, there’s all kinds of them that have shortfalls. And if I scroll all the way to the right, we do see that there is node saturation in here. So the node group that these things are running in do have nodes that are hitting 100%. So this is not healthy. That’s probably causing out of memory kills. You see, there’s some restarts in here. If I sort overall we probably see a lot of restarts in this environment. These might not be the ones getting restarted. Something else might be getting restarted because you know, it has different algorithms, how it chooses what to start, but you really don’t want this happening.
So we probably want to give these values, the ones that don’t have values, we need to increase them to mitigate that risk. So That’s usually the other is probably cost savings opportunity here, but we want to fix that risk first. So that’s typically here from a memory request perspective. And again, usually we have interest from both FinOps and from SREs. The SREs are really interested in these warnings on the right. The FinOps folks are really interested in this big yellow bar because that’s where the cost savings come from. So even with all these problems, there’s probably opportunity to save some money in this environment. And if I click here. I will get out of the view that shows me the CPU request surplus for this environment.
So again, this is somewhat similar to the view I just showed you from memory, but it’s completely different data, sorted differently, filtered differently. And what we’re saying here is, let’s just take this top one as an example. This otel-collector, open telemetry collector is been given 1,000 millicores, but it’s about a half a CPU too big. And there are four of them running. So we’re stranding 4 CPUs by doing so. And this in turn is wasting money.
Now, these are pretty small examples. This is just a lab. If you run this against a big environment, you’ll see this. We’ve seen a single container wasting hundreds or even in one case, I think 1,200 CPUs being wasted by one highly replicated container.
So this is the top 10 list of what you want to fix to save money. And if I go down to the bottom here again, I can see the same things. If I look at the request, the request is up at 1,000, we’re saying it should be down closer to 500. If you do that across all the different replicas or copies of this thing, you’re going to save some money.
Now, key to this is it’s not quite that simple. So that’s the waste that those are the oversize, but I need to look a little bit further to the right here to know if I can safely do this. And so let me go over to the right here.
So these are the node groups that these things are running on. And these are the analysis of, I showed this a second ago of whether there was saturation in those node groups. So first of all, If I see any CPU saturation, luckily I don’t hear, but if any of the nodes are running at a CPU, I don’t want to be downsizing the CPU requests that could make that worse. So again, if there are servers that are running at the roof and I’ve got so many containers on it, if I start downsizing the containers and Kubernetes schedules even more on each node, it’ll just make that problem worse. So we like to filter this and say if this is over 0% or over 5%, I don’t want to be taking that action. So that makes sure It’s a safety net to make sure that I’m not going to make something worse by taking this action.
It’s the same for memory. I might want to filter out some of these ones as well, because if I start increasing the density, downsizing things and running more on each node, memory problems could get worse. So we might want to filter these down on only things that have zero or very low percent to make sure they’re safe. And again, this is a lab, so it’s a bit messy, but in real environments, we typically see some node groups have saturation, some don’t, and you just want to prioritize the ones that don’t make things worse.
So these two columns are really useful for making sure the actions are safe. The third one is useful to make sure it’s going to actually make a difference. And so, in this case, for example, this first row the memory is the constraint on the nodes. If you look at the primary constraint of the nodes, memory is the actual problem. So I can downsize CPU all I want. It’s not going to make my density any higher. It’s not going to save me any money because all the nodes are out of memory or many of the nodes are out of memory. So, this primary constraint is very important. You see the second one that’s doable because in that node group, CPU is the constraint. So if I downsize my CPU requests, I will immediately have a benefit. And so we call that a realizable gain. If there’s zeros here and the constraint is CPU or for memory, if there’s zero in the constraints, your memory request, then you can take that action and actually save money right now. And again, this is a lab, so there’s not a lot of opportunity here, but in a real environment, that’s key. And we call that realizable gains.
And so I can do that in this view by clicking on these things. This table acts like Excel. I can actually do all kinds of powerful stuff in here. I can choose what I’m looking at. I can pick different columns. I can pick different formulas and sort orders, and I can do all these things right into this table. I can even do pivoting and charting from this table, but we’ve built in a whole lot of these views predefined.
So I’ve shown you three of them. I’ve got kind of clicked down from the histogram, but they’re all also up here. So you see up here in the selector. I have different views and I’ve been in the system views.
So when you click around from the histogram, you get to these views that are already built in CPU risks, CPU wastes, memory risks, and memory waste. And right now I’m looking at the CPU requests surplus in this environment. Which is just telling me all the ones that are oversized. If I go to this realizable CPU savings, it actually filters that all down for me. And it’s showing me that one example of just showing that is safe to do and will make a difference. And again, in a real environment, this would be full. This would be all kinds of savings here. This is just a small lab, but I wanted to make the point that you can actually view all these predefined views.
You can also make your own views, and I’ll cover that in a minute, but that allows you to navigate around. So from here I can flip around and look at different views or I can just go back to the histogram and click down and see the results. Again, clicking on these things will just go to one of those predefined views.
So that’s the main value prop. We find that there’s a lot of cost savings opportunity. There are a lot of risks. Also, you wanna fix those. If you fix the risk, usually you unlock even more cost savings opportunity that the histogram is a great way to visualize that and to understand that again for each of your different environments.
Now, the last thing I’ll show here is that typically when people go there, they go down to these drill downs, and that’s like your top 10 list of what to do. Typically, you might have to interact with different teams, maybe different application teams or engineers to get those changes made. And so with that in mind, what I’ll do here is I’ll show you I can navigate around the tree, or I could even just do searching. And I’ll go find one of those one of those filters that I was, oh, I’m in the wrong view here. Let me go to break down by cluster breakdown. Let’s go by to more standard view here. And I’m gonna search for this PME server, which is one of the ones I showed earlier. If I go here, I actually get a homepage for that container. I can see the path of the container here. I can see that, oh, this one actually is hitting its limit. It’s probably having problems. I can actually see the restarts over here. I can see that’s also on a node group that’s out of memory. That’s potential problems.
.This one has all kinds of problems. I see all the recommendations. I see the financial impact and all those curves are here as well. So this is very important because it’s kind of the one stop shopping for a container. And if I grab this thing and copy the address, this is shareable. Everything I’ve shown in this in this UI is deep linkable. So I can send you a link to this.
container homepage, I can send you a link to the table I was just showing you. So you can share links , from anything, anywhere I’ve navigated in the tree and anything I’ve looked at. And so, for example, in this case, I might want to send this to the app team saying, Hey, you might want to fix the limit on this.
It’s having a problem. And, , the goal is eventually to automate these environments, but oftentimes there’s some trust that needs to be gained. And when you send this to them, they can explore for themselves. They can see the workload curves. They can see the results of the analytics. and understand that.
So very important to be able to show the evidence of why we’re giving the recommendations we’re giving. And that’s all shareable. So again, you can prioritize, find the top 10 things, communicate those recommendations and actually make a difference in your environment.
So next I want to get into node analysis. So what I’ve been showing so far , is in the container section of the UI. But if I go down the node section of the UI. What I get here is a view of all my node groups and my nodes. So you see at the top here there’s a tab, I’m in the node groups, and then I’ll talk about the nodes.
There are a number of node groups in this environment. Again, this is just a lab environment, so you see there’s not many nodes in these groups. But from this view and what I can do is look at it in a couple different ways. So I’m going to go up to here, and we also have these views here, and I’m going to view it from a health perspective first.
So now what I’m looking at is, these are my node groups, I can see the number of container manifests in them. I’m looking for that saturation that I was talking about a minute ago. So here you can see I don’t have any nodes that are running at a CPU,, which is great. Usually there’s a lot of CPU capacity in most environments, but we do see cases where CPUs are strapped and you want to make sure that if that’s the case, you’re aware of it.
And like I showed earlier, that can affect what actions you take. We also look at the balance ratio. And again, this is an overly interesting data, but if I have some machines that are really high CPU and some are very low, that will show up here. Even if everything’s fine, you don’t want to get too big an imbalance because that’s going to limit you at some point.
And it’s especially true of memory. So in this case, you see, there’s a lot of memory saturation. This is pretty ugly. It’s a lab again. Hopefully your environment doesn’t look this bad. But I want to see if anything’s running out of memory because that can affect my optimization strategies. And I want to see if there’s any imbalance across the memories and in actual running environments with large production workloads, oftentimes we will see the case where The nodes get out of balance because Kubernetes is having a hard time scheduling because the requests are all wrong.
So if your requests are wrong in that histogram, if you have a spread of yellow and red and gray, oftentimes we’ll see pretty big imbalances in the nodes and you want to fix that. Unfortunately, I can’t show any real great examples with the data I’m running against right now, but that’s where that would show up.
So that’s from a health perspective. If I go over and look at the views again, I can look at , And that’s a kind of a different view. So in this case, now what I’m looking at is the environment and I can rank things, for example, by their utilization levels. But also what the primary constraints are, the current node type and what the optimized node type would be.
So for example, a lot of times we see things running on well, this is like a compute optimized node, , this row here. If you were to fix all the container requests and limits, you would actually should be running on a memory optimized node. And if I scroll to the right, there should be a savings number over here.
That’s telling me if you were fully optimized environment, you can save quite a bit of money. I know this is a small environment, but picture these numbers being a hundred or a thousand times bigger is what we usually see. And so that’s going to let me know that, , what’s the size of the price here.
As far as if I would optimize, usually we find things will tend towards memory optimized because memory is the big constraint in a lot of environments. But in the current SIP configurations, we see things running on general purpose , or CPU optimized nodes. And they probably shouldn’t be. They have to be because of the way they’re configured.
But if they’re optimized, they shouldn’t be. And so there’s a big opportunity here to save across the nodes. So that’s the waste view. I’ve shown that the risk, you can make your own views in these tables as well to see it however you want. But this is for the node group level. I can also go down to the node level.
And these are the actual nodes running in those node groups. And I see the node groups here. So I can filter in on, I don’t know, I can just say, just show me agent one and agent two node groups. And there’s not many nodes here, but I can see these things and I can see their average utilization. I can see when, you know, how long they’ve been running.
These are very long running nodes. I can turn on here. For example, if I go in here, I can turn on. I also want to see the, not just the average, but the peak memory utilization. So I can fully customize these views. And now I’m seeing these ones are actually sitting at 99. 4%. They’re basically saturated.
, and I can see that at the bottom here. These are the actual utilization curves of these nodes, and they’re kind of bumping the roof. So this is a lab. So this is intentional in our case, but you probably don’t want this happening in your environment. So I can go down and actually see quite a bit of visibility into the nodes themselves and the node groups.
And that’s all very important to understand if I have any risks or any opportunities at this level. And equally as importantly, this is all factored into the container analysis. So when I went through the container analysis, the, all this data is feeding into what I want to optimize at the container level.
To make sure it’s safe. So we call this full stack optimization. We want to make sure that any action we recommend is fully informed by everything going on in the environment.
next I want to move on to navigation views and filters, and you might’ve gotten a bit of a peek of this earlier, where I went through this tree. Right now I’m looking at it. I’m just going to go back here and say, go to the cluster breakdown. And I see a view in this case of my environment by cluster namespace.
This is the pod and this is the container and you saw when I clicked around before I can get right down to the container level or at any other level. If I click on it, I will get, you know, in this case, a histogram of what that level looks like. So I can navigate around here and I can see that now I can make different views here.
For example, and there’s a bunch in here because people have been adding them in our lab environment here, but I can make public or private views. So if I click here. The cluster breakdown is by cluster namespace and what’s called a pod owner name, which is either the pod , or the deployment or the replica set, whatever is controlling , these deployments.
And I can actually make a new one or I can copy this one. When I copy it now, I can say, let’s give it a new name. Let’s call it I don’t know. We’re doing a demo. So let’s call it demo cluster breakdown or let’s call it demo view. Just to. just to be safe here. I can set it as public. I can set it as private.
I’m just gonna make mine private. I can even make it my default. So it’s pretty powerful. I can share it with everybody or I can just make it mine. And I have cluster namespace, pod owner name. Let’s go wild here. Maybe I want to see it in a clean way. Maybe I want to add let’s just say container name to the right at the containers already show up at the root of the tree.
And sorry, I’m having trouble adding it. Let me see if I can Add a different one. Okay. I’m not able to add them right now. I must be in some kind of read only mode here. But if I add a layer, say business unit into here, I can add that and I can move these up and down in the tree. So what I’ll do is , I’ll just do this a way.
I’m going to actually go and move namespace to be the top of the tree. So I’m going to see my whole view by namespace. Let me apply that , and say, okay, Okay. Now I should have a new, sorry, a new view up here, which is my demo view. And if I go there now, the top of my tree is actually the namespaces.
And so let’s go to the, I don’t know, the cloud watch namespace. That’s the cluster that that namespace is in. Here’s the pods. So you see, I’m just flipped it on its head. I let’s go down and find , here’s kube system. Kube system exists in pretty much all of the clusters.
So I can just kind of twist this around. I can do it by business unit. I can even put container name at the top of the list and say, just show me all my nginx at the very top of the list. So I can flip this on its head and navigate any way I want. I’m sorry, I keep going to the filters and, I can go back to cluster breakdown.
That’s usually the one I want to make my default. And in fact, I’m going to actually set it as my default right now because it isn’t. So the next demo I give, it’ll work the way I want it to. So
cluster breakdown is a default, but I can make it, the tree behave any way I want. Okay. and I can filter it on anything I want.
So in the filters, now I’ll pull this thing up. I’m in my kubernetes environment, which really is telling me anything that’s kubernetes, that’s my default filter. Let me copy this one and let me say the kubernetes environment without kube system. We find this is a common thing where, you know what, my kube system.
I don’t actually set my requests and limits in kube system. , so what I’m going to do is I’m going to add a row to this filter, and I’m going to say namespace you know, not equal to kube system. And logic. Both those things have to be true. So if I say,, apply or okay I won’t set this as my default filter.
Say okay on that. Now I should have a new filter here. And it’s grays without cube system. And if I open these trees up now, you’ll notice cube system is gone. And in fact, now I’m down to 1000 containers instead of 1200. So this is a pretty common one. Oftentimes, like I said, a lot of the gray on the histograms comes from cube system.
And so you can do that. And I just don’t want to see that stuff. I can filter out stuff that I’m not going to change or I’m not able to or allowed to change and get a more focused view. So really it’s a nice way to set up, not only the structure of the left hand side, but also what I want to see.
And , in this case, in our case, I’m still seeing a big gray bar. So that’s not great. If you have a lot of gray outside cube system, you probably want to fix that. If I go and edit this one again, I’m just gonna do one more thing. I could even say let me just copy this one and make a new one.
And I want to say, and container name. Equal to engine X, and I’m just going to call this engine X filter. So I can even go to this level and say, all right, let’s just do that. And then if I go to that one engine X filter, now I’m still viewing it by cluster and namespace, but I only get my.
And so these are where they all sit. So I can choose any cost, any container and kind of filter it this way. Now I’m down to 20 containers. I can see they’re pretty horrible. Most of them don’t have a request value or they’re five times too big. So again, we probably need to clean up our lab. You see now the dollar sign thing appear here.
Cause we just did a cost saving opportunity there. So anyway, views and filters, very powerful. I can actually go and look at my environment. That’s it. any way I want, view it any way I want, filter it any way I want. I can set things as my default. Let me just set that as my default and navigate again.
So a very important way to isolate. So you’re looking at exactly what’s important to you in the way you want to look at it.
Next, what I want to talk about is advanced metrics viewing. So for that, I’m going to go back into the histogram. Let’s just drill down on this yellow and talk about what we’re seeing under here. , we highlighted this one before. It’s a bit bigger than it needs to be.
And I have different utilization curves down here. Let’s pull up the utilization curve for the busiest replica. So we track both the average for replicated workloads. We track both the average of all the replicas and the busiest of all the replicas. And that’s on any hourly sample. What’s the highest we ever saw it yet?
So this is an interesting curve. So this is, we’re saying that that workload does sometimes reach up as high as one CPU of utilization in our ML model. Now, what I’m going to do is go over to these other models here. So I can click historical hourly. And what that gives me is not the machine learning model.
This is the actual history of this thing. And you see now this bottom context slider appears, and we’re looking at the days of data. So I think by default, we’re looking at a week here. Yeah, this is the last week of operation by hour. And I can slide this around. I can slide that hour window back. Again, we typically operate off 95 days of data.
So we make sure we get a full business cycle. I can see everything that went on , in this replicated set of servers. Very nicely I can zoom in and say, Hey, let’s get that little peak there that I see at the bottom. Oh yeah, that looks good. Spend about a day or be better part of a day,, up at high utilization from an hourly workload perspective.
So I can get down to see this anywhere. I have this modal up. I can do this functionality. If I sent the link to this, to an app team, they can go and do this and explore the data. So it’s kind of like Grafana being built , into the product. I can actually get a pretty rich view right in the environment and share it with my colleagues.
We can also see the raw samples. So this is a candlestick saying for each hour. What was the range of operation? , we went as low as that and as high as that within the hour sustained. The thick blue part is the sustained activity. I can also say, just show me my raw samples. And so if I do that, it’s going and getting the actual raw data coming back from Prometheus.
We store it all historically ourselves, so you don’t need Prometheus to have a lot of history in it for us to do this. We keep the history automatically. So in this case, now I’m looking at the last week of raw samples. And I can see exactly what this thing does , in a more higher resolution form, in 5 minute samples.
So very, very useful for getting down, getting to the bottom of things in terms of, I want to go back and see exactly what happened and understand the evidence of why certain recommendations are being generated. So that can be done right from this model. I can look, I can also look at daily if I want to.
So really, more of a summarized view of just the daily level range of operation. Let’s give this a second to come up and, I can see that again, back through the last 95 days as well. So. various really useful views for understanding what on earth is going on. Of course, you can just rely on our ML models.
That’s a summary of what’s important out of all that data, but you can see the raw data if you want to. Now that’s available from anywhere you see one of these curves, there’ll be this little open, this little break open kind of thing that brings up the modal. The other way you can do this, and I’ll go back over and do another search and I’ll search on one of my favorite work modes, PME.
This is the one that’s having all kinds of problems. And again, I can pull that up from here and see what this thing is doing, but you’ll notice how near I didn’t show this before, but I’m in the overview, but I can also go to what’s called a metrics viewer. So this is the same thing, but a more powerful version of it because I can add multiple things at once.
So for example, what I can do is I can say, you know what, I want to see the resident set size. Let’s look at the working set utilization. It’s not the total memory, the working set of the busiest
container of this group of containers. There’s only one, and that’s this one down here. Let me turn off this CPU.
Now it’s this noise. Now I’m looking at the working set for the busiest container, and I can also then go and say, and I want to look at the restarts and I can see them. Looks like this one’s restarting periodically. Not all the time, but I can flip this to the raw data. And I can start to draw correlations.
So if I start to zoom in now on times here at the bottom here there. Now the data has come in. Let me just not look at the last week, but let me zoom in on a small amount here. You can see that thing here where you have the memory, the megabytes, the memory saw toothing and restarting.
And so we can start to draw those correlations. If you want to see if something restarted because of a memory situation, you can see that kind of thing right from here. So this component is useful for looking at multiple curves at once. and kind of drawing correlations between the data. It’s also shareable.
I can grab the deep link of this and send it to someone and they can view this. So again, really useful. A couple of different ways to look at , the deeper workload data, both in context, but also I can just go to this, what we call metrics viewer for a given container and view the details side by side.
the last thing I want to show is I’ve talked about how to reconfigure the views and filters in the preview. I’ve talked about how to drill down on things. The other thing I can do is I can actually go down. Let me just go down to the I’ll just click on AI analysis details. And now I have one of these views I was showing earlier, the surplus CPU requests.
So all the things that are too big from a CPU perspective. And I showed how there are different system views that are built in, and I can go navigate to any one of these things very quickly and get them from the histogram. And I can share links to this. So if I share the link to this page, so I didn’t let somebody else see this exact same view.
But one of the things I can do here is I can also make my own views. I can just make one from scratch or let me just duplicate this one and I can call it copy one. Let’s just give it that name. That’s fine. Why not? And now what I can start to do is say, you know what? I don’t want all these columns.
Let me go and get rid of I don’t know. Let me let me get rid of the actual request. Let me just show the surplus and then I’m going to get rid of a whole bunch of other columns here. I don’t want to see the no data. I can completely customize this any way I want. I can add the uptime, I can add the total hours,, whatever I see I think is relevant, and start to make my own view.
So I can change the column headings. I can also say, you know what, only show me things where they’re more than I don’t know one CPU too big. And then only show me things that have, that are the hotel. I can customize this any way I want. I can look for only certain namespaces, only certain values.
I can even start to do grouping and filtering. There’s all kinds of powerful stuff you can do in here. And so I can totally customize this and just save it save that view. And now that’s my personal view of this and it shows up down in my private views. So now I got my own, Andrew’s view of CPU surplus.
That’s just what I want to see. Organized the way I want to see it. I can also add that to my favorites. So if I do that, now it becomes up to the top of the list. And I can say, you know what, now this one’s right in the top of my list because it’s one that I want to see all the time.
And I can also make it public. So that’s the other thing is down here if I say make public, and I’ll just do this and mess up anybody else who’s demoing. They’re going to start seeing more , in their public views. Now you can see I’ve made it a public view that other people can use as well.
And I’ve seen that some of my colleagues have been doing the same thing. So this is a very important thing that if I want to make, you know, a fin ops oriented report that has things organized the way I want them scoped down to what I want to see already with this, when I click around, it will show me the same view for wherever I click.
Now, of course they’re clicking around. It’s a very small view, so these don’t all have it. This has one. So. This view is useful for wherever I navigate. I will see this kind of view of things. But it’s very useful to make fin offs, use, make risk oriented views. I want to really focus in on the engine Xs that are restarting or that kind of thing.
Top restarts. I can make any kind of view I want and then share it with others. So this is important to work with these views and filters. If I customize the tree to show what I want and navigate the way I want and show me what I own or what I’m responsible for. I can make the view, show me exactly I want to see.
I still get all that great drill down. I can share it with people. I can download a CSV of it so I can still pull it out and work in Excel if I want to, all from this table component. So a very powerful table component that you can completely customize however you like.
And that’s it. Those are the areas I wanted to cover in going through an overview of the new Kubex product. Hopefully that’s helpful in understanding how to use it and try it out in your own environment.
Thank you very much.