Machine Learning and Ignition
Ignition Community Live48 min video / 1 minute read
IIoT can give us a huge stream of data. You are probably using Ignition to store and visualize that data, but can you use Ignition to make predictions that will make your operation smoother, make operators happier, and increase the bottom line?In this session we’ll see some ways to combine Ignition and the power of machine learning to find the solutions hiding in your data.
Kathy: Good morning, everyone. Thank you for showing up for Ignition Community Live today. This is our weekly series of live chats, and today's topic is Machine Learning and Ignition. I'm Kathy Applebaum, your host today. I'm one of the senior software engineers here at Inductive Automation and I have a master's degree in Computer Science, where I worked on machine learning, data mining, and data warehousing. This is definitely a chat today, so I'll be taking your questions both during and at the end of today's webinar. So, IIoT can give us a huge stream of data. You're probably using Ignition to store and visualize that data, but can you use Ignition to make predictions that will make your operation smoother, make operators happier and increase the bottom line?
In this session, we'll see some ways to combine Ignition and the power of machine learning to find the solutions that are hiding in your data. The first step is to figure out what machine learning even is. There's a wide overlap between data analytics, machine learning and artificial intelligence and experts disagree on the exact dividing lines between them, but I like this definition a lot. Our machine learning should let us make some prediction to be useful and there should be some automation to that prediction. Otherwise, it's just compiling data to let an expert human look at later, which can be useful at times, but it's not what we're after here.
I also like that we have the words past data in this definition, because machine learning should be rooted, as much as possible, in real data. I'll talk a bit later about what to do when we don't have as much data as we'd like. But if machine learning isn't rooted in data, then it's just machine wild guessing. Our brains have an amazing capacity to recognize patterns and find trends. In fact, some of the hottest research topics of machine learning today are things that a five-year-old can do easily. For example, recognize a stop sign or a truck, understand a sentence, or pick out all of the pictures with dogs in them.
You know, my car shows a visualization of what's around it and recently, that visualization was updated to include things like traffic cones and bicycles. I was really excited about this improvement and I was trying to describe to someone how cool this was, and I realized I was trying to get them excited about my car basically being a toddler. That's about where we are with machine learning right now, the smart toddler stage. So, you have to pick your tasks for machine learning with that in mind. If humans are pretty good at making predictions, then why are we even bothering with machine learning? First, humans just cannot keep up with the flood of data that we're generating today. Even a few sensors generating only one data point per second can quickly overwhelm just about anyone, let alone trying to keep up with this 24/7, 365.
Humans also tend to simultaneously under and overestimate the chances of something happening. If I see there's a 15% chance of something happening, our brains conclude, it's just not gonna happen. But a 15% chance means that on average, one time out of seven, it will happen. That's certainly not all the time, but that's really not a rare event. So, we need to kind of get our brains in the mode of thinking about what probabilities really are. Our brains also over-generalize from bad events. If you've been bitten by a dog, your brain is gonna assume that all dogs are out to get you. That's really helpful for keeping you alive, because you're going to avoid dogs, and the downside perhaps to avoiding dogs is, you don't get to pet the cute puppy, but you don't get bitten either. But that's not the kind of accuracy we want when we're trying to run our operations. We'd like to be able to basically pet that puppy, right?
Humans are also really, really good at seeing trends that are not there and then making connections between events that are not actually connected. So, for example, you can't get the flu from a flu shot, but we coincidently get sick shortly after an unusual event like getting the flu shot and our brains make that connection. Again, that's really useful for keeping you alive. You know, you eat a strange mushroom and you get sick the next day, you're gonna avoid that strange mushroom in the future. But this is not the kind of accuracy that we want when we're trying to run our whole operation.
So, the other thing is, humans get bored doing repetitive tasks. This is gonna lead to inattention and sloppy work. It's just natural, we can't help it, but that's going to cause us to miss rare events, like a defect. This has been a problem when trying to train cars for self-driving, as they start doing some things with humans at the controls and humans just get bored and they get sloppy. Also for things like tagging images and things like that. By the time you've looked at a hundred images in a row, your brain is just not paying attention to details anymore. So, this has caused some problems in machine learning research, because we're trying to use humans as the standard and humans naturally make mistakes and are sloppy, and especially, when it's a very repetitive boring task.
So, now that we know what machine learning is, we'd like to look for a project to start on. So, what qualities should we look for in our very first project? Machine learning thrives on data, especially high quality data. So what do we mean by high quality? Well, we'd like them to be accurate, for one, because if our data isn't accurate, our predictions won't be accurate either. And I kinda talked about that a little bit in the previous slide, where humans can actually mislabel things, and we have to take that into account when we're trying to decide if our data is even accurate.
05:48 We'd also like to have very few missing or null values in our data, if possible. Those missing values could have been important outliers. And the time it takes to clean them up or eliminate them can slow down our whole project. So, especially if you're looking again for defects, sometimes if the humans have been trained, for example, that they get bonuses for fewer defects, not all the defects might have been logged. Sometimes the defect can't be logged because the defect is the machine breaking down, and if the machine is doing the logging, it didn't log itself when it broke down.
So we need to think very critically about the missing values in our data and decide whether those missing events are important in themselves or if they can be eliminated or not. The big question though usually is knowing if you have enough data. And unfortunately, that's very hard to measure until you try to make a prediction and test those results. The more noise in your data, the more data you're going to need. We can figure a straight line from just two points if those two points are really accurate, but the real world does not have perfect accuracy. We have a lot of noisy data and it becomes much harder to predict that straight line when there's a lot of noise in the data, and we need a lot more data to be confident in our line.
We'd also like to be able to have a measurable result in a reasonable amount of time. So what does this even mean? Let's say that we're trying to predict the correct setting on a machine for certain conditions. We can measure something about the output of that machine, like, maybe the number of defective items it produces in the next week. Okay, that's fine. But let's say that we're trying to measure time before failure. That's definitely measurable, but if the mean time before failure is typically years, then we're gonna have to wait a really, really long time to know for sure how good our predictions are. It may take a number of years before we have a significant number of failures, and that's not the kind of turnaround time that we certainly want for our first project. We'd like that feedback to be much, much faster. So, pick a project where you can get the feedback in a reasonable amount of time.
We also want to be able to use our prediction to take an action. Whether that action is to change a setting on a machine or do some maintenance task or order a certain amount of raw materials, we need to put our prediction into action before it's valuable to us. I've seen a lot of projects where someone goes through a lot of effort to make a prediction, and then that prediction doesn't let them do anything differently than they would have done anyways. At that spot, the prediction is interesting and it was a good exercise, but it's not of any real use and so it becomes a wasted effort. We'd also like our project to be something that humans are bad at and machine learning is good at, right? Because that way we're gonna get the most results. Usually that means something with a firehose of data coming at us, because remember, humans can only handle so much data at a time. We're kind of limited in our bandwidth.
But I find the real victories could be taking over tasks where humans get bored. Again, when humans are bored, they make a lot of mistakes, but also there's the psychological factor, in that, this is going to free the people up for more interesting tasks, that's gonna make them a fan of your project, and that's going to make it much easier for you to do the next project. You're also gonna get more cooperation from the humans, right? You're not taking the good part of their job, you're taking the awful, boring part of their job, and they will definitely buy into that. So, there's a lot of wins for taking something that humans are bad at and doing that with machine learning. So, now that you have a general idea of the project you want to do, we'll need to do some background work. It's really tempting to just dive in and start playing with algorithms. I sometimes fall into that temptation myself, but I've also learned the hard way that you end up wasting a lot of time and energy if you do that.
You know, my background before computer science was math and so, sitting there and playing with the formulas is like a form of relaxation and fun, but you have some prep work to do first. Okay, so the first thing that you really should do is write down what your business goal is. Say that you're working on a predictive maintenance project. Your business goal might be something like, reduce downtime due to equipment maintenance and breakages. We're not gonna set a numerical goal yet. We just want a general business goal. We wanna identify what is the operational benefit that we're gonna get from this project. This is gonna help keep you on track, and quite frankly, it helps you if you need to suddenly do an elevator pitch of your project. Having a specific goal like this will also help us when we get to the step where we define success, and we'll get to that later. Note that we're trying to reduce total downtime, not just unplanned downtime. A very bad predictive maintenance algorithm would constantly have machines down for planned maintenance in order to meet a goal of zero unplanned downtime.
However, that strategy doesn't really help us out, right? We're gonna have no production either, so that's not really the goal that we wanted to have. We could even make our goal better by changing it to include reducing maintenance and repair costs, as well as downtime or to include an increase in production. Thinking out this goal and writing it down will help make sure we get to where we wanna go. Obviously the first time you do this, it's gonna be kinda hard to write a really good goal. But as you then start to see what your algorithms tend to optimize for, you'll get a little more experience at figuring out, "Okay, what is the goal that I really, really want?" We don't wanna have a goal that is gonna just have our machines constantly down so that we have no unplanned downtime. We need to figure out the real goal that we were after here. Machine learning is all about that data.
We'd like to think we have plenty of good, clean, accurate data that's already in the exact form we want it in. The real world is never like that. The thing is, you need to spend 80% of your time acquiring and cleaning your data, and that's true way more often than not. Even just transferring data from one form to another can lead to pitfalls. I recently worked on a project already in CSV files, it had already been used in a machine learning project, and I just needed to get it into a database, and everything seemed to go great. Importing a CSV into a database is a known thing, it's very easy to do, except that some of my query started turning up empty when I didn't expect them to. And it turned out, the CSV files had all of their boolean values as the literal words "True" and "False." The database import was expecting a one or a zero and it didn't see that, so it translated everything to false. This was really easy to fix, but if I hadn't expected some data in a certain query and come up empty, I wouldn't have noticed that.
You really have to look at your data hard and make sure any time you're transferring it from one area to another, that you've gotten that data right and you found all of the pitfalls. And sometimes you can look at the first 10 or 20 lines and think everything is clean and then suddenly at line 1000, something goes south for you. So, definitely have an idea of what your data should look like, and look at more than just the first few lines to see if things are going wrong. At this spot, when I say identify your data issues though, we're not actually doing the data cleaning yet, we're not doing the work, we're just trying to identify what are the issues that we will have when we do the work. So, do we think we have enough data? Again, that's hard to tell until you actually do the project, but sometimes if you only have a few hundred lines you may think, "Okay, this is clearly not enough data. So, if we don't have enough, can we get more? Can we estimate more?" This is very often an issue when you're trying to predict failures or defects, because if you have one defect out of a 1000 items or out of 10,000 items, this is a pretty rare occurrence, right?
And so, if it's one out of 10,000, you can have 100,000 lines of data, but you've only got 10 defects in there, that's not very much. So, you may have to be able to generate some false defects as it were. In other words, make some little bit random changes in your current data to increase the number of defects that you're training your machine on. You have to be a little bit careful when you do that, but you can definitely generate more data with estimated data. We need to think about, is the data already in a form that we can use? Obviously, if it's sitting on a clipboard somewhere, that doesn't help us, we need to get it into a computer, but sometimes the data is in multiple places. It might be already in your Ignition databases, but you might have to get some stuff from an ERD or other places in the whole business will have the data that you need, or maybe they're in another database that you need to get access to, you need to move it to the database that you're using. So we need to look at how much work is going to be involved in getting that data all into one place, and it's a place you can use it.
When you think about the accuracy of our data, and what can we do to improve that accuracy if needed? So again, this is where you're gonna be looking at where the null values are, where the missing values are. Also, just generally, think about the accuracy as it was collected. If your data is coming from people writing down on clipboards, did they transpose numbers, did they forget things? Even sensors are not 100% accurate. You could have had a thermometer that was kind of accurate and then at some spot, it got replaced with a more accurate thermometer. And then at some spot, that thermometer broke and started being inaccurate again. So you can't always assume that automated data collection is 100% accurate, it's often very precise, but accuracy is different than precision. And that's another fault of our brains where we see a very precise number, and we assume that it's accurate and those two things are not at all the same. If your accuracy is not where you need it to be, then what can you do to improve that accuracy?
Again, sometimes that's replacing sensors, sometimes it's other issues, but start thinking about that now at the beginning of your project so that you can make any changes that you need to make early on. And do you need to do any data cleaning? And if so, is it a little or a lot? So by data cleaning, that's things like replacing that word false with a zero and the word true with a one. But sometimes it's other things. You may have a date as a string form, and you need it as some other form, there's various different types of data cleaning that you may need to do, and I'll talk about that a little bit when we get to the Ignition portion of this. So, why are we trying to identify all these issues early on? We need to know how much work we're in for, and we need to know if we think we can get enough data to succeed? If we determine at this spot that there's just no way we can get enough data to be comfortable with this project, then we can find another project to do honestly, because sometimes we just don't have the data. And you want to find that out early, rather than when you get to the end of your project.
We're gonna need to identify our implementation constraints. So we all have implementation constraints. In a way, that's a really good thing, because it helps us narrow down the possible implementations we even have to consider. One of the biggest constraints I usually see is whether you can go off premises with your data. A lot of machine learning algorithms are very CPU intensive, and very memory intensive. When I did my master's project, just one run through training one model took several days and I had to train 121 models. So, [chuckle] the vast majority of my time on my master's project was just letting my computer run 24/7, because I was trying to do it on a local laptop. And it was a pretty high-powered laptop, but it was not the kind of computing power that you can get when you go to some place like AWS or something like that. These cloud services will do that heavy-duty crunching for you, then they'll provide you with an API for making predictions, but you may not be allowed to move that data to the cloud. That's a pretty common thing.
People are pretty protective of their data for good reason, and they don't want it moving off premises. If that's the case, the cloud services are not gonna help you, and you need to be able to go for an in-house implementation. This is one of your implementation constraints. If you're training your model on premises, you need to decide if you want a 100% Ignition solution, or if you want your machine learning to live outside of Ignition? Having the solution completely inside Ignition makes using your predictions in real time really convenient. But unless you're writing your own Python scripts for complex algorithms, the choice of algorithms is going to be limited to what's already in Ignition. Going outside of Ignition gives you more options, but then you need to connect your predictions to something Ignition can use, like a database or some kinda restful service that you set up with that outside thing. If you're allowed to move historical data to the cloud, but not live data, you might wanna consider a hybrid solution like AWS Greengrass.
So in that case you train a model in the cloud, then you download that model locally, and use a module from Cirrus Link, one of our strategic partners, to connect Ignition to the model for making predictions. This is a really great way to bring machine learning to the edge-of-network devices that probably don't have the computing power to initially train the model, but edge-of-network is usually where you want to actually use the model, so this is a really convenient solution for that. So if we've gotten this far, we think we're gonna be able to make a successful prediction, we've identified the business goal we want, we've identified any issues with our data, we think that we can get enough good quality data and we've identified our implementation constraints. So we still have to think about how we're actually going to make use of our prediction.
Do we wanna connect our prediction to a tag in Ignition that will then do something? Is this prediction going to be input for a script that's gonna run at regular intervals or based on some event? Is this prediction gonna be written to a database for use with other data at a later point? If we think about how we want and need to use our prediction, that's going to help narrow down some of our choices later on, and it's going to help get us on the right track for how we're gonna implement this machine learning solution. So we also need to identify our definition of a success. Success seems like an obvious concept. You're successful or you're not. But I've seen this trip up so many projects. What does it really mean for our project to be successful? We're unlikely to get 100% accuracy. So what is the level of accuracy that we're going to be happy with? One answer you hear a lot is I want it to be more accurate than what we're currently doing. That's a great goal, but it means you really need to know how accurate you're doing right now.
This comes from how accurate your data is. For example, going back to the defect prediction task. If currently, your defects are being spotted by humans who log manually, how are you able to track the defects that your humans missed because of inattention? How are you able to track the defects that your humans missed because they have an incentive not to track them? Do you have a way to track the items that a human's thought was defective, but turned out not to be? What happens if the defects were correctly detected, but incorrectly coded for some reason. So not a malicious missing, but just somebody marked the wrong box accidentally. These kinds of things affect both your definition of success and how your machine learning model even gets trained in the first place. But let's assume our data is okay and we're just trying to define success. There's several things to consider here. One is the cost of false positives. So back to our defect detection project.
A false positive is to say a piece is defective, when it's not. The cost of that is gonna be special handling, possibly discarding a perfectly good piece. Another cost is false negatives for a defect detection project that saying a piece is good when it's not, potentially sending your defective piece out the door. Depending on the piece, that cost may be very minimal or may be very high especially if you become responsible for problems that piece causes. You send a defective automobile part out, that causes the auto to crash, that can be a big problem. So as we get fewer and fewer false positives, we may be getting more false negatives or vice versa. So you need to definitely take the cost of each kind of error into account. So that means if we successfully predict good versus the defective items 98% of the time, and we're wrong on 2%, but our overall defect rate is only 1% of items. We would have done twice as good, just always predicting that an item was good, because then we have 99% not 98%
We'd only be wrong 1% of the time, not wrong 2% of the time. So populations that are really highly skewed like this are more difficult to use to train models because we're looking for a rare event. We may not have much data, and we're also tempted to think our model is doing well when it's not. This sounds really obvious, but I saw this so many times in grad school where a student would get up and very proudly say, "My model's doing awesome. It has this great rate." And then you look at the population and you're like, "No, that's not great. This is worse than just saying always it was category A." So the point is if our predictions aren't making some improvement in our overall process, they aren't of any real use. So in this case, when we're wrong, twice as often as just saying, everything was always good, this is not helpful. We need to make sure that our machine learning project if it's successful is actually doing something positive for us. So I've given you a lot of background material just on machine learning projects in general, and that's really helpful and really necessary.
But you all came here to learn how to combine Ignition with machine learning. So I'm not gonna leave you hanging any longer on this. So you probably already know that Ignition is great at collecting data from your PLCs and other devices, and then storing that data in a consistent usable format into databases. We get so spoiled with this that it's really easy to forget how unusual this is. I've seen so many data collection efforts where chunks of data are suddenly in a different format or the data came from clipboards and it's manually entered into Excel spreadsheets. Being able to do these things through Ignition and how smooth Ignition makes it, it's really, really nice, I have to say. And remember that if data collection is a chore, you can bet it's the last thing people wanna do. So by automating this data collection through Ignition, you're gonna make it way easier to collect the data, than to not collect it. This gives your workers an incentive to do what you need them to do. And that means it's gonna get done.
So when you add a new tag or you have other input into your data stream, it's gonna be really easy to integrate that into your existing data, it's gonna make that so much easier to use. Plus having that data in a consistent format makes your data cleaning tasks either non-existent hopefully, that's rare, but it can happen or at least much, much easier. If you remember a few slides ago, where I talked about my problem with the words true and false always being interpreted as false, I used a script in Ignition to read in the original CSV files, I made the needed changes in my script, and that made the changes in the data, and then inserted the clean data into the database. You won't always have it this easy, cleaning data is usually a lot harder, but being able to easily write and execute those scripts to do it is such a huge help. So taking the time to visualize your data, can also save you a lot of time in the long run. Remember when I said that our machines are already doing machine learning.
Well, our brains are amazing at pattern recognition. And seeing your data on a chart can let you easily recognize a lot of trends, you might see how many clusters your data is in, whether your data is linear or not, how noisy it is. Our brains are just super fine-tuned for this kind of task, so take advantage of it. When you've gotten your data into a database, and it's time series data, then making a chart out of the data is as easy as setting up a couple of database pens on an easy chart. If it's not time series data, then the classic chart is really great for visualizing. You instantly know if your data fits a nice line, if there's a lot of noise in the data, or if there seems to be no correlation at all between things.
One added bonus is that you're gonna find out if your data needs more cleaning because if you try to chart it and you're getting a bunch of errors, you know that something's not right and you need to figure it out. You'll also see if there's big gaps in your data. When you're confronted with a million lines of data, it's hard to notice that a month's worth of data is missing, but when you throw that onto the Easy Chart, you're gonna see this huge gap and it's gonna be very obvious. I don't even think about choosing an algorithm until I've done this step of visualizing. I may think something's a regression problem, in other words, a problem where I need to predict a value and then I see when I chart it that my data is really in categories. There's big clumps of data and predicting value may not be that useful or it may be that once I know the category that the value was obvious. I might even wind up going back to my planning steps and rethinking how I'm going to use the data once I see it or I may be really sure that a simple linear regression algorithm is gonna work well and when I see the data, I realize it's not that easy.
Always, always, always take the time to do the visualization step, you will not regret it. Neural networks and genetic algorithms, you can find some really great examples of how to use these in the machine learning workshop package from ICC 2019 that's available for free on Ignition Exchange. Just go to the website, choose Ignition Exchange, download it. It's gonna be, if you look at the top ones, it's gonna be the first few lines at the top ones. That package is gonna work on any version of Ignition 8 and it has really well-commented scripts that are gonna show you step-by-step how to use some of the common algorithms and make predictions with them. Definitely, play around with that to see what Ignition can do.
If one of these algorithms doesn't do it for you, then you can also use the built-in Python scripting to just write your own. Now, remember that Ignition uses Jython, so you can't directly use popular libraries like NumPy or SciPy, but you can do quite a bit of machine learning if you just roll up your sleeves and have at it. Most machine learning is really linear algebra and Jython can handle that perfectly fine, but of course you can do machine learning outside of Ignition and then pull the results in. If you wanna use a cloud service, most of them have a RESTful web interface for uploading training data and then later you can upload your inputs for predictions and getting the results back. That takes very few lines of scripting at Ignition and you can just access those services and get the data and write to tags and make real-time use of your predictions.
As I mentioned earlier, if you trained your model in the cloud and then are using something like Greengrass to make predictions locally, you can use the Cirrus Link module to connect that Greengrass instance to Ignition. This is, excuse me, a really powerful tool. Most of the time, we need our predictions at the edge-of-network honestly and that's where we don't have the most computing resources. The fact is machine learning models are memory and CPU hogs while they're being trained, but once the model has finished training, the actual prediction-making portion is usually very, very compact and very efficient. It's great at the edge-of-network. Definitely, plan on this when you're doing your project.
Another way to combine machine learning inside and outside of Ignition is to take some mocked-up data and move that to the cloud and take advantage of how easy it is to substitute in different algorithms to see what works best for your type of problem. Once you've narrowed down to an algorithm or two, then you can move back to 100% Ignition solution with your real data. That keeps your real data out of the cloud that may work for the constraints that you have. And then, you can either use the built-in algorithms in Ignition or you could use a custom script implementation, but that takes advantage of the power of how easy it is in the cloud to change your algorithms up, but lets you keep everything, all your actual data in-house.
You may have noticed that I'm not actually recommending any specific algorithms to try here either in machine learning in Ignition or outside of Ignition, and that's gonna be for a couple of reasons. One is even if I knew exactly what type of project you're doing, there's often no single best type of algorithm. I'm usually gonna give you two or three to try. And an example of this is that for image processing, for classifying images, typically neural networks and support vector machines are the algorithms to go to, but my whole master's thesis was on showing that neither one of those really work well for classifying certain types of astronomical images. And since I did that thesis, other people have worked on the same problem and tried other algorithms and come up short too. We actually have not yet found a good algorithm for that type of image processing.
We can have some educated guesses, but a whole lot of it is just gonna be trial and error and seeing what works. We can point you in the right direction, but we can't say definitively, "Always use this." And that's why having the ability to try a lot of different algorithms is really, really helpful. As I said before, unless you actually use your prediction, it's not really worth anything. It's certainly possible to use your prediction outside of Ignition. For example, if you're predicting something like raw materials needed, that needs to be communicated to your purchasing people. Oh, no. Come to think of it, you could have Ignition. Just send them the email, right? But often, we're predicting things like maintenance needed or a setting for a machine or which items are defective. If we're predicting maintenance needed, tie this in with your equipment schedule component in your employee schedules and that'll let you choose the best time for maintenance and communicate it to everybody that something needs to be done and even what needs to be done.
Machine settings are an easy one if they're connected to a PLC, but at the very least you can make a nice display for the operator and let them know what to do. And basically, with every prediction, you should find a way to have Ignition automatically implemented if it's appropriate or to have Ignition let the humans involved know what that prediction is and what the next step they should take is. This is ... Ignition is so good at that human-machine interface that we should totally take advantage of it. Track, track, track and then track some more. I haven't really talked about testing yet, and since that's related to tracking, I'll talk about both of them right here. The general rule of thumb is set aside about 30% of your training data and use it for testing, if possible, never test with data that you used for training. If you do that, you're gonna run into a problem called overfitting, where the model hyper-fits itself to your noisy data and becomes useless for making predictions on new data. If you don't have enough data to do that, there's ways to get around it, but in general, set aside 30% randomly, and that should be your goal. Don't set aside the first 30% or the last 30%, choose it at random.
So you have this data set aside for testing and you're gonna use that data to make a prediction. So why would you use data to make a prediction that you already know the answer for? It's precisely because we already know the answer, we can tell how close our prediction was to the actual value. We can then decide if we're meeting our measure of success that we decided on back in the beginning, we do this using what's called a cost function.
That's gonna be simple as yes or no, if we're trying to predict a category, we got it right or we got it wrong, or it can be the difference between two numbers if we're trying to predict the value. But remember we talked about false positives, and false negatives, and how one might be more costly to us than another. That's why we want a higher penalty for the one that's more costly to us, or we may have a cost function where any value that's kind of close is okay, but after a certain point, we want a really high penalty. So many machine learning algorithms will let you design a custom cost function, that's used for training and scoring the model that you build. I would try one of the built-in ones first, but think hard about, is this cost function really optimizing for the thing you want it to optimize for?
Okay, so that's the testing portion. What about tracking? Well, tracking is basically ongoing testing, right? Our model is, is it still successful as we thought it would be on day one. So on day one, we've done our testing, we have an idea of how successful it will be. If we don't continue to track, then we won't know if conditions are drifting. Maybe there's some variable we didn't know about that has changed over time. Maybe machines are wearing, maybe our raw materials have changed and we don't realize that things are just different now, if we don't track, we won't know. And honestly, tracking results is something we should be doing anyway. Even without machine learning, we don't know if an operator is doing something wrong, or whatever, but especially with machine learning, because there's less human input we tend to let things go on autopilot a little bit more. So we definitely need to track those results on an ongoing basis. One nice thing is, if you're tracking your results with Ignition, okay, you can do this automatically, and you can set some kind of alarm point where if your results drift by so much, then Ignition can send you a text or an email and say, "Hey, things are going off the rails, pay attention to this now." It's much better to get that notice sooner rather than later. So take advantage of Ignition's alarming to help you with this.
You're excited about what machine learning can do. You're tempted to start some big shiny project, but like any skill, doing a good machine learning project takes practice and you're gonna be a lot less frustrated and a lot more happy if you start on a small project, honestly. Do the big shiny project as your second or third project. In the university classes that I teach, students constantly ask me for step-by-step guides for cleaning data and I really wish it were that easy, but the way that you learn to gather and clean data is by gathering and cleaning data. Every project's gonna have some different nuance or some new stumbling block and everybody's gonna have a different tool or trick that works best for them. So I can't emphasize this enough, but get comfortable mucking around with data, especially scripting or find someone who's comfortable to help you with this project because as I said, 80% of your time is gonna be mucking around with data. So just get used to it.
Try doing the project both in Ignition and the cloud to get a feel for what the differences are in each environment. Ignition is gonna get you up and running really quickly, especially if you download that machine learning project from the Exchange, but it is harder to try new algorithms, using the cloud resources are gonna let you experiment a lot very quickly, but there is some cost involved with the cloud, right? They're out to make money and so they're gonna charge you for those things. That charge can be a very small cost if you're doing a project that is really useful to you, but you need to just try it both ways and see what's working better for you. So once you've trained your model, test it with the data you held back and see how you did, if it didn't meet your standards for success, go back as many times as you need to and as many steps as you need to. It's very rare to find a machine learning project that's perfect on the first try. You need to just get in the mindset that this is an iterative process and you can continue to make improvements even after you put it into production, that's just the way machine learning projects are.
And remember, if you have questions on your project, don't be afraid to reach out. One great way is to post on the Inductive Automation forum; there you can not only connect with Inductive Automation employees, but you'll get ideas from our fantastic community of forum members who are using Ignition out in the field. They're just like you, they've run into many of the same issues already, and they're very, very happy to help. And then when you get that product into production, be sure and let us know about it. We'd love to hear about what you've done with machine learning and Ignition. So I'm gonna open it up to questions. Remember, this is a chat and I really wanna hear what questions you have. Okay, so one question is, “How much do you need to understand algorithms to use them?” You don't need to understand like the ... It's usually a quick Google search for most of the algorithms. So for example, if you're thinking of using a neural net, you need to understand that they're really good at answering yes or no questions, and they are something that can take a lot of inputs.
I would say that you don't have to understand all the ins and outs, you just need a general five-sentence understanding of an algorithm to be able to use it. “Okay, how do I compare the Ignition platform with other platforms, open platforms for machine learning?” I would have to say that Ignition is not primarily a machine learning platform, right? It's primarily SCADA and HMI. So the machine learning things that are built into Ignition are just a few algorithms. It's not the main point of Ignition. So if you're gonna do something really, really heavy duty, I would be looking at solutions outside of Ignition, and then pulling those answers into Ignition, but it's a really good way to get started, and certainly for small projects you can keep it 100% in Ignition.
“Where can I find a demo of Ignition doing something with neural networks?” I would take a look at the project from Ignition Exchange. We might have done a neural network example in there, and if not, let me know and we can probably get one going, but I'm pretty sure there's one in there. So check that out first. Let's see, “I'm new to machine learning, do you have some good resources to get a better idea for how to implement machine learning?” Unfortunately, a lot of the resources are not good for beginners. Unfortunately, there's like a steep learning curve for some of the things. But again, I would take a look at that demo project. I know that I've mentioned it a lot, but it runs you through three really... I think it's three, really simple machine learning tasks, and the scripts are just so well commented that you can see what's going on step by step with the scripts, so you can see what part is the data collection. You can see what part is the actual machine learning. You can see which part is testing, you can see which part is implementing the prediction and that will give you a really good feel for starting out, and that will get you a really good jump-start.
“Do I recommend doing machine learning projects on production servers, if not what architecture do I suggest?” I would never train a machine learning project on a production server, unless you're doing something that requires ongoing training. And the reason is that machine learning is such a memory and CPU hog, that you could inadvertently bring your whole production down to a halt. And you don't wanna do that. So for training your model initially, I would do it on something with a lot of memory. Memory usually is the bigger one than CPU. If you're having database access or a lot of data that you need to get to, obviously you need a fast disc. So I would train it on something that really can handle that kind of data. Once you get the model trained most of the time, then you can move that to your production server because the accessing the model is very low overhead, and then it's fine.
“What exactly does it mean to clean data? Yeah, so cleaning data has a lot of different meanings to it. So one is just sometimes your data can't, literally it just can't be imported the way it is. If you recall, my true and false example that didn't throw an error, but it needed cleaning up before it could be imported into the database. Sometimes things are just misformatted or missing. There's a gap in your data and it means that you can't even import it until it's been cleaned up. Sometimes your algorithm can't handle null values, and you need to get those out of there. So, basically, it's getting the data into a form that you can use and getting rid of erroneous data if you can find it.
For example, you may be logging a process where you know the temperature can never go over 100 degrees centigrade. And there's some data that has 400 degrees centigrade. Obviously, that's bad data that needs to be dealt with in some way. So all those kinds of things are cleaning data. So we have a couple of questions on whether we're going to build more machine learning libraries into Ignition? There are some plans for that. We're kind of taking a look at what is missing out there because we don't wanna duplicate what is already readily available to all of you, but we would like to make it easier to use machine learning in Ignition. So certainly, if you let us know about what projects you're doing. And what things would have helped you do this project better in Ignition. Absolutely let us know. You can send me an email. You can contact our sales engineering team or your sales rep. You can let us know on the forum. There's a lot of ways to know what to do. Or to get this information to us, but definitely get it to us because we're kind of forming right now, the idea of what improvements need to be made for better using machine learning in Ignition. So we'd really, really like that information from you.
“What is the name of the Exchange project?” So, Ignition Exchange is a way that users like you, and also Inductive Automation staff can upload resources for Ignition. So a lot of times they are project resources, but if you go to our website you can see a link for Ignition Exchange. There's a whole resource area where people have uploaded hundreds of different resources to help you. And you can download any of them for free. The resources will tell you what version of Ignition you need, it'll tell you what modules you might need, if any. So it's a ... I would just spend some time looking around at what's on there because there's been an amazing amount of things that are available. And in fact, I might put a link a little later on on the forum, if you're familiar with how to get to the forum, forum.inductiveautomation.com, then you can see how to get to Ignition Exchange because that's a really, really useful resource for people.
“What algorithms do I recommend for time-series data for classification?” I had a whole thing written down on that for somebody a couple weeks ago, but if you ask me on the forum, or you send me an email, then I can get that information to you. Obviously, on the forum then everybody can take advantage of that. Do I have any examples of applications that have been implemented? If you go to the 2018 ICC videos, you can see the machine learning session that we did. Kevin, my co-presenter there, showed an example of a solar farm demo that's been done, or machine learning project that's been done. And if you go to some of the Firebrand Awards, I believe there's a couple that have used machine learning in those. Unfortunately. a lot of our end-users don't necessarily want their projects publicly disseminated [chuckle]. For obvious reasons, and of course we're not implementing the projects ourselves, so we have a little bit of a limitation in how many we can talk about. But if you have done a project, definitely submit it for the Firebrand Awards even though ICC will be online this year. Submit it, and I'd love to see what you're doing there.
“Do you know what the common architecture is for a Greengrass Ignition solution? Are you running a Greengrass instance directly on a gateway?” Usually for edge-of-network, the Greengrass instance is gonna be directly on the same computer as in the gateway. So obviously you need a big enough computer to handle both of those things. But that's not that hard to do honestly. You could obviously access it over a network, but generally when you're edge-of-network, you kinda want everything like all in one box, if you can, so that's generally the way it's gonna be done. Can you recommend an analytics platform which can reduce the gap of not really knowing what to expect from data? Unfortunately, no. You have to have some kind of expectation. And again, that's part of where I like the visualization because the visualization can help you with your expectations and at least know if your expectations were in the ballpark, somewhere. Yeah, [chuckle] I wish. If you find something, let me know, 'cause I'd really love that magic wand, right?
“If you have really large amounts of data, how do you know how to think of it for training or should you use all of it?” Boy, what is really large amounts of data? For my master's project, I had a quarter million records, and it was not enough. But part of it was again, remember I was doing very skewed categories, so I might have a quarter million records, but I only might only have 100 instances of something that I was trying to find. So was that really large amounts of data, or was it not. 100 records was not. But even of the things where I had tens of thousands of instances, it still was not that great. But let's say that you're really looking at Big Data. Remember the definition for Big Data is more than fits on one machine. Then obviously, yeah, you can thin it out. And there, I would probably, even if I didn't have some reason to think that some data was more accurate than others, I would just select at random, honestly. Figure out how much you really think you need. If you think 100,000 records is enough. Choose a 100,000 at random. Don't choose it from the beginning, don't choose it from the end. Random is much better.
“Do you think we can remove the role of SME and ML by running analytics for a COMP motor turbine CMS?” Maybe, I don't know. It'd be worth ... It'd be an interesting project for you to try. So we're gonna wrap this up, and I thank all of you for attending. I'm so happy to have you all here and otherwise, just let us know on the forum or by email if you think of things in the future. And thank you so much for attending today's webinar.