Showing posts with label ETL. Show all posts

Monday, 4 April 2022

Python ETL Pipeline: The Incremental data load Techniques by Haq Nawaz via @Medium

The incremental data load approach in ETL (Extract, Transform and Load) is the ideal design pattern. In this process, we identify and process new and modified rows since the last ETL run.

Code is available on Github. I can see that it is picking up just changes but I wonder for a lot of data how efficient that actually is and whether that comparison should be done at the source or off somewhere else in the cloud where it can't affect the source's performance. Something to consider.

Friday, 15 January 2021

Big Data Architecture in Data Processing and Data Access by Stephanie Shen via @DataScienceCtrl

Over the past 20+ years, it has been amazing to see how IT has been evolving to handle the ever-growing amount of data, via technologies including relational OLTP (Online Transactional Processing) database, data warehouse, ETL (Extraction, Transformation and Loading) and OLAP (Online Analytical Processing) reporting, big data and now AI, Cloud and IoT.

This was very clear and insightful. Worth a read as I think it could clear up a few misunderstandings.

Wednesday, 8 July 2020

WEBINAR: ETL & Advanced Machine Learning - Open Source, No Code Required - 16 July 2020

Sponsored News from Data Science Central

Data practitioners create value to their organization through ETL, visualization, machine learning and deployment. Their flexible and reliable working tools are equally important as the ability to collaborate in-house and with the community. KNIME is an open source platform that covers this entire life cycle, free, and easy to install and use.

Join the free webinar on July 16, 1:30 PM
It's happening in two time zones: Americas (CDT) and Asia/Europe (CEST).

Register Now

Join the team of Data Scientists for a quick and practical introduction to KNIME Analytics Platform.

What you will learn in this session:

Reduce the time needed to automate ETL.
Integrate new data science methods, from simple to sophisticated such as deep learning, advanced ML, text mining, and time series.
Explore a modern, open source data science platform with a visual workflow editor.

Tuesday, 29 January 2019

WEBINAR: Cutting Time, Complexity and Costs from Data Science to Production - 6th February 2019

WEBINAR

Cutting Time, Complexity and Costs from Data Science to Production

One-click (really!) deployment to production without any heavy lifting from data and DevOps engineers

Wednesday, February 6 at 8am PT

Imagine a system where one collects real-time data, develops a machine learning model… Runs analysis and training on powerful GPUs… Clicks on a magic button and then deploys code and ML models to production… All without any heavy lifting from data engineers. Today, data scientists work on laptops with just a subset of data and time is wasted while waiting for data and compute.

It’s about efficient use of time! Join Iguazio and NVIDIA so that you can get home early today! Learn how to speed up data science from development to production:

Access to large scale, real-time and operational data without waiting for ETL
Run high performance analytics and ML on NVIDIA GPUs (Rapids)
Work on a shared, pre-integrated Kubernetes cluster with Jupyter notebook and leading data science tools

Featured Speakers:

Yaron Haviv, CTO, Iguazio
Or Zilberman, Data Scientist, Iguazio
Jacci Cenci, Sr Technical Marketing Engineer, NVIDIA

Saturday, 22 April 2017

5 Data Management Mistakes to Avoid during Data Integration Projects by @mairabay via @hlsdk

In this article Canada based Maira Bay de Souza of Product Data Lake Technologies shares her view on data integration and the mistakes to avoid doing that.

I agree with her observations but feel that there needs to be more focus on the data that this article describes (although that could be because I spent so many years doing the detailed design for data integrations and loads on a data warehouse).

My thoughts are:

You need detailed documentation of the data at source, target and any processing in between. That documentation should cover formats, values, lookups, defaults, translations, timezones, currencies, master data location/values and anything else you can find.

You need to think about if you need to handle Slowly Changing Dimensions at all stages of the integration as they could impact your interface (I don't think they are something that only affects a Data Warehouse)

Sunday, 5 February 2017

The Rise of the Data Engineer by @mistercrunch via @freeCodeCamp

This great article shows how the transition from business intelligence engineer to data engineer. This is a great article and makes so much sense. Definitely a must read article.

On a personal note I feel I am probably closer to a data engineer then to a data scientist.

Thursday, 25 February 2016

How The ETL Bottleneck Will Kill Your Business via @forbes

How The ETL Bottleneck Will Kill Your Business by Dan Woods ( @danwoodsearly ) via +Forbes - The landscape of data is growing rapidly. We now have access to new forms of big data, but also many high quality curated data sets from APIs etc. There is a crucial skill that used to go by the name of ETL that is highly undervalued and crucial to making all of this work.

Very interesting article. Please note it is 4 screens long.

Tuesday, 6 October 2015

Three best practices for building successful data pipelines via via @radar @tianhuil

Michael Li says one of his biggest headaches was locking down his Extract, Transform, and Load (ETL) process. His team at Data Incubator has trained hundreds of data science fellows, and heard, over and over, that one of their biggest challenges is also implementing their own ETL pipelines. Here are his 3 engineering best practices that can make your data analysis reproducible, consistent, and productionisable, so you can focus on science instead of worrying about data management.

Great article that explains clearly what needs and should be done to get data in successfully. I spend part of my working life making sure the analysis and data source side of things were locked down. I admit I never put analysis code under source control, but thinking about it now I can see there would have been a benefit if I had.

Thursday, 1 October 2015

WEBINAR: The Future of Data Warehousing (ETL Will Never be the Same) - 5 October 2015

The Future of Data Warehousing

ETL Will Never be the Same

Traditional data warehouse ETL has become too slow, too complicated, and too expensive to address the torrent of new data sources and new analytic approaches needed for decision making. The new ETL environment is already looking drastically different.

In this webinar, Ralph Kimball, founder of the Kimball Group, and Manish Vipani, Vice President and Chief Architect of Enterprise Architecture at Kaiser Permanente will describe how this new ETL environment is actually implemented at Kaiser Permanente. They will describe the successes, the unsolved challenges, and their visions of the future for data warehouse ETL.

Thursday, 3 July 2014

5 steps to offload your Data Warehouse with Hadoop

This +TDWI whitepaper is produced by Syncsort.

It's all about finding the most costly ETL and replacing it with equivalents in MapReduce.

Data