Showing posts with label PYSPARK. Show all posts
Showing posts with label PYSPARK. Show all posts

Monday, 3 January 2022

A new Era of SPARK and PANDAS Unification by MA Raza, Ph.D. via @AnalyticsVidhya

Pyspark and Pandas. A PRACTICAL GUIDE: SPARK 3.2.0 A new Era of SPARK and PANDAS Unification Pyspark and Pandas

I found this very interesting even though I didn't understand all of it.

Monday, 24 August 2020

Containerization of PySpark Using Kubernetes by Ajaykumar Baljoshi via @sigmoidInc

 This article demonstrates the approach of how to use Spark on Kubernetes. It also includes a brief comparison between various cluster managers available for Spark.

I thought this was a really good article with a great level of detail. If you are interested in doing this in real life I recommend you read this first as there are code snippets and it will get you ahead of the curve.

Tuesday, 21 May 2019

WEBINAR: From Pandas To Apache Spark™ - 30 May 2019


Data Science Central Webinar Series Event
From Pandas To Apache Spark™
Join us for the latest DSC Webinar on May 30th, 2019
Register Now!Databricks
Presenting Koalas, a new open source project unveiled by Databricks, that brings the simplicity of pandas to the scalability powers of Apache Spark™.

Data science with Python has exploded in popularity over the past few years and pandas has emerged as the lynchpin of the ecosystem. When data scientists get their hands on a data set, pandas is often the most common exploration tool. It is the ultimate tool for data wrangling and analysis. In fact, pandas’ read_csv is often the very first command students run in their data science journey.

The problem? pandas does not scale well to big data. It was designed for small data sets that a single machine could handle. On the other hand, Apache Spark has emerged as the de facto standard for big data workloads. Today many data scientists use pandas for coursework, and small data tasks. When they work with very large data sets, they either have to migrate their code to PySpark's close but distinct API or downsample their data so that it fits for pandas.

Now with Koalas, data scientists get the best of both worlds and can make the transition from a single machine to a distributed environment without needing to learn a new framework.

In this latest Data Science Central webinar, the developers of Koalas will show you how:
  • Koalas removes the need to decide whether to use pandas or PySpark for a given data set
  • For work that was initially written in pandas for a single machine, Koalas allows data scientists to scale up their code on Spark by simply switching out pandas for Koalas
  • Koalas unlocks big data for more data scientists in an organization since they no longer need to learn PySpark to leverage Spark
Speaker:
Tony Liu, Product Manager, Machine Learning -- Databricks
Tim Hunter, Sr. Software Engineer and Technical Lead, Co-Creator of Koalas-- Databricks

Hosted by: Stephanie Glen, Editorial Director -- Data Science Central

Title: From Pandas to Apache Spark™
Date: Thursday, May 30th, 2019
Time: 09:00 AM - 10:00 AM PDT

Space is limited so please register early
Register here

Monday, 5 September 2016

Airline Flight Data Analysis – Part 1 – Data Preparation by Michael Kamprath via @DIYBigData

This is the start of a PySpark data analysis project concerning airline on-time performance. In this post, the usefulness of the Apache Parquet data format is explained as data is loaded and cleaned.

This is a great post and has code and a link to his github containing the code.