Batch Processing
Complete Developer Podcast - A podcast by BJ Burns and Will Gant - Giovedì
Even though most developers prefer to build applications that respond to user input directly, in most applications there are usually at least a few processes that happen out of band or even overnight. These processes require a different way of thinking about your code, and about the processes that support that code. You might use offline batch processing for a variety of purposes within your application. Whether it is used for things like sending emails in bulk, importing large amounts of data from external systems (or pushing data to external systems), or for managing outages and latency in external services. Handling interactions with unstable, slow, or limited third-party systems is often best handled with a batch process. In addition, many industries have a tendency to prefer batch processing to processing-as-needed. Banking is one example of an older industry that tends to process things in this manner, and many other older and heavily regulated industries. Batch processing is common in older industries and is often used to shift system workloads into time periods where utilization is lower. At some point in your career, you’ll probably be tasked with writing a batch process to handle off-hours data processing. We hope that the set of suggestions and questions offered here will make it easier for you to successfully build batch processes when that day comes. Episode Breakdown Start up How often and where does this thing run? The physical execution environment is important, especially if the process has to have completed successfully in a short time frame. It’s very common for really crappy companies to try to run batch processes off of developer workstations, or even older computers stuffed in a closet somewhere. This tends to slowly get worse over time and fail suddenly. All it takes is for the cleaning staff to unplug a machine, or for a power surge to hit the building and you’ll have a mess on your hands. You also need to think about the digital environment you are running in. Are all your servers available at the time of execution? Are they running intense batch processes themselves? What about the databases you are using? Is maintenance being performed on them during this time window? What about time zone issues? Could your application’s scheduled run time be missed during a shift to day light savings time? Could a forced system update occur during your startup window? What resources are available during the run? How much memory, disk, and network I/O are available? Just because there is a lot of any of these things available when you are writing the process doesn’t mean that they will be available when it starts up. This could impact how long your process takes to start up or whether it can even start at all. Sometimes systems are turned off or otherwise unavailable after hours. A lot of times, work systems will be turned off entirely after hours, especially if you are using a cloud environment. Third party systems may be under similar constraints. Services you are depending on on the system itself may also be off during this time period. How do we handle failure to start up? Just because you schedule a process to run at a certain time, doesn’t mean it actually happens. Sometimes the task scheduler chokes and crashes. Permissions can also change, which could stop your program from running. Your application might start loading and then fail before actually starting processing. This could result from anything to not having all of its dependencies available, to simply not being able to contact the system that it uses to determine what work needs to be done.