This is the start of a PySpark data analysis project concerning airline on-time performance. In this post, the usefulness of the Apache Parquet data format is explained as data is loaded and cleaned.
This is a great post and has code and a link to his github containing the code.