Michael Li says one of his biggest headaches was locking down his Extract, Transform, and Load (ETL) process. His team at Data Incubator has trained hundreds of data science fellows, and heard, over and over, that one of their biggest challenges is also implementing their own ETL pipelines. Here are his 3 engineering best practices that can make your data analysis reproducible, consistent, and productionisable, so you can focus on science instead of worrying about data management.
Great article that explains clearly what needs and should be done to get data in successfully. I spend part of my working life making sure the analysis and data source side of things were locked down. I admit I never put analysis code under source control, but thinking about it now I can see there would have been a benefit if I had.
No comments:
Post a Comment
Note: only a member of this blog may post a comment.