Webinar Preview (Part 2): ETL Techniques for Large Datasets
This is a preview. To access the entire conversation and additional resources, join the CURRENT Community for free and join the growing group of leaders advancing the digital supply chain in the electronics industry.
Video Transcript:
Wilmer Companioni:
So you mentioned cost. Right? I guess that’s one of the big trade offs that you have to do. Right? When you’re balancing know, something like cluster size with workload time, those are the two things you’re looking at. So are there is there anything else besides just those two items?
Dave Antosh:
Cost is a major component. There are some things that by their very nature are very difficult to make parallel. So that is where paying a developer becomes the cost. So if something is difficult to make parallel, that is code time that is spent to make that scale. So those are gonna be two ways that it can cost something.
WC:
So it’s so the cost can go in one of two places either in, you know, the the size or the number of clusters that you have or spending extra resources and cycle times in sort of chunking it into parallel efforts, and that’s obviously a dev effort. Right?
DA:
Yep. And it needs to always come together, but, you know, increasing one would increase the other, but we’ve been doing this for a while now. So we kinda can make that trade off more apparent upfront and have it be one or the other.
WC:
Alright. So a bit of a corollary to that, Dave. Is it fair to say that you should always break down the transform phase into the smallest steps possible? Or is there a drawback with having too much granularity?
DA:
Yeah. That’s a good question. I like to break them into reasonable steps, but if we break them into the smallest possible, each one of those is a potential point of failure, potential area to debug. Sometimes that’s worth it. Sometimes it’s not. That’s another one of those trade-offs. Right now, our steps are fairly granular or they are larger steps that are batched into pieces. So the same step occurring over and over, but with different data driving it.
WC:
Okay.
DA:
And that kinda gives us a best of both worlds.
WC:
Okay. So Dave, as far as I know, you’re human and legend has it, you’ve made some mistakes in your career. Can you share an occasion where you got stumped by something simple like typecasting or type mismatch that you found after hours of frustration and how can somebody avoid that?
DA:
Yeah. So we do a lot with automated testing. I’m a big fan of test driven development or behavior driven development. Been doing them for a long time. And even with that, sometimes things get through. So recently, I missed a NOT in an IF statement, and it was for an edge case, so it didn’t happen often. But it occurred the error occurred after loading about two and a half years’ worth of data through, which took a week. And finding and tracking that data through billions of documents was not a lot of fun. So, the best way to avoid it is developing good depth practices, good testing practices, and those are the things everybody goes through when they’re going to school or reading up on development. They are good things to use. I will say in a pragmatic fashion. Avoiding the strictest dogma, but maybe taking it a little stricter than you’re originally comfortable with, push yourself into the correct direction when developing a sophisticated piece of software.
WC:
Yeah. So you’re saying something like, you know, testing your transformer, your importer, something with a small chunk of the the larger dataset before you go in and send the full thing through. Right?
DA:
Yep. And we do it at all levels. We have unit tests, integration tests, and functional tests that do it with, varying levels of granularity.