Webinar Preview (Part 1): ETL Techniques for Large Datasets
This is a preview. To access the entire conversation and additional resources, join the CURRENT Community for free and join the growing group of leaders advancing the digital supply chain in the electronics industry.
Video Transcript:
WILMER COMPANIONI (0:04):
Hello. My name is Wilmer Companioni, Director of Business Development here at Orbweaver, and welcome to our webinar series. I’m here with Dave Antosh, our Chief Systems Architect. How you doing, Dave?
DAVE ANTOSH (0:16)
Not bad. How’s it going?
WC (0:17):
Pretty good. Alright, Dave. So, the topic of today’s webinar is ETL techniques for large data sets. Something that I know you’re pretty intimately familiar with. And, Orweaver’s position as a leader in sales procurement and data automation solutions, is built in part by our experience in the electronics industry and by experts like you, Dave, that tackle our industry’s data challenges head-on almost daily.
So Dave, you’re ready to get going. I’ll throw you a few easy questions to start with.
DA (0:51):
Yep.
WC (0:53):
Alright, Dave. So, as you know, our industry is one that deals with data that is both deep and wide. And what makes that challenging, from an import perspective are some of the things that you’ve dealt with. So what are some of the ways that you’ve overcome those challenges?
DA (1:12):
Well, one of the things with large data sets that makes it so difficult is you can’t spot check all of the data.
So, when we’ve run an import of many millions of parts and we miss a single part. It is not likely to be found for weeks to months at a time. So, first and foremost, preventing that from happening and putting practices in place where you can address those problems was quite a challenge. And, Reproducibility with Kafka and our own systems for rerunning transforms and our views into them are big parts of that solution.
WC (1:53):
Can you elaborate on some of those checks that you gotta have in place in order to have a good sense of the data quality?
DA (2:01):
Sure.
There are automated spot checks in code on common data formats, common data patterns.
When we get into the specifics, we do a lot of QA with full data sets. We smoke test. Like I said, too many to test them all individually. And then, we have tracking patterns for watching for success and errors as data goes through the system. And with something like Kafka, where there is configurable data retention, we are able to spot problems usually before they occur and address them if they do.
WC (2:43):
Alright. So, as you know, Dave, a bunch of the world or most of the world’s data is unstructured.
Does the structured or unstructured nature of the data affect your approach for handling some of that data quality?
DA (3:02):
Not really. I think, while I like to put it, while everybody’s speaking the same language, everybody speaking a different dialect.
So, we’re able to translate between those languages and what might appear unstructured to one person. There actually is an internal structure, and we’re all saying the same things. So, we’re able to figure out what that means and turn it into our own internal structure as a source of truth and work with that data.
WC (3:34):
Alright.
So, as you can imagine, Dave, ideally, we want both speed and scale in our designs, and you’ve actually been pretty successful in getting both. And I know it’s not easy to get both. Can you highlight some of the trade offs that you have to take between one or the other and what you can do to get both speed and scale?
DA (4:02):
Yeah.
A message-oriented architecture is a huge help. We can see how many messages are there to be processed, and we can scale based on that.
We can scale as high as you want. The issue is, how much does that cost compute resources aren’t free. So that’s the major trade off. The way you can squeeze the most out of that is by good metrics, good profiling, making sure you’re not reserving more than you need by the opposite token, make sure you’re reserving enough that you need because something failing to an out of memory error in a step, nine hundred thousand of a million isn’t a lot of fun.
The major trade off is cost.
Dev time is not really an issue with the way we have our patterns going forward now borrowing a lot from, functional programming.
And, pure versus side effects, we’re able to move data through the system in a deterministic way.