Data: SPARK

Showing posts with label SPARK. Show all posts

Monday, 3 January 2022

A new Era of SPARK and PANDAS Unification by MA Raza, Ph.D. via @AnalyticsVidhya

Pyspark and Pandas. A PRACTICAL GUIDE: SPARK 3.2.0 A new Era of SPARK and PANDAS Unification Pyspark and Pandas

I found this very interesting even though I didn't understand all of it.

Friday, 31 July 2020

10 big data blunders businesses should avoid by Sara Brown via @MITSloan

Big data is a promising investment for firms, but embracing data can also bring confusion and potential minefields - everything from where companies should be spending money to how they should be staffing their data teams.

This was an interesting read and definitely a good list to use as a basis of what you need to avoid in order to not make a mistake.

Tuesday, 21 May 2019

WEBINAR: From Pandas To Apache Spark™ - 30 May 2019

Data Science Central Webinar Series Event

From Pandas To Apache Spark™
Join us for the latest DSC Webinar on May 30^th, 2019

Presenting Koalas, a new open source project unveiled by Databricks, that brings the simplicity of pandas to the scalability powers of Apache Spark™.

Data science with Python has exploded in popularity over the past few years and pandas has emerged as the lynchpin of the ecosystem. When data scientists get their hands on a data set, pandas is often the most common exploration tool. It is the ultimate tool for data wrangling and analysis. In fact, pandas’ read_csv is often the very first command students run in their data science journey.

The problem? pandas does not scale well to big data. It was designed for small data sets that a single machine could handle. On the other hand, Apache Spark has emerged as the de facto standard for big data workloads. Today many data scientists use pandas for coursework, and small data tasks. When they work with very large data sets, they either have to migrate their code to PySpark's close but distinct API or downsample their data so that it fits for pandas.

Now with Koalas, data scientists get the best of both worlds and can make the transition from a single machine to a distributed environment without needing to learn a new framework.

In this latest Data Science Central webinar, the developers of Koalas will show you how:

Koalas removes the need to decide whether to use pandas or PySpark for a given data set
For work that was initially written in pandas for a single machine, Koalas allows data scientists to scale up their code on Spark by simply switching out pandas for Koalas
Koalas unlocks big data for more data scientists in an organization since they no longer need to learn PySpark to leverage Spark

Speaker:
Tony Liu, Product Manager, Machine Learning -- Databricks
Tim Hunter, Sr. Software Engineer and Technical Lead, Co-Creator of Koalas-- Databricks

Hosted by: Stephanie Glen, Editorial Director -- Data Science Central

Title:	From Pandas to Apache Spark™
Date:	Thursday, May 30^th, 2019
Time:	09:00 AM - 10:00 AM PDT

Space is limited so please register early

Monday, 12 November 2018

WEBINAR: Scaling Big Data Pipelines in Apache Spark, No Coding Required - 15 November 2018

Various companies across multiple industries collect and house vast amounts of data. However, most face the same challenge: the ability to process big data and quickly find insight within its framework. Introducing KnowledgeSTUDIO with Apache Spark, the ultimate solution for both data scientists and data analysts. The graphical user interface with Big Data capabilities allows organizations to build pipelines seamlessly.

Topic: Scaling Big Data Pipelines in Apache Spark, No Coding Required
Date: Thursday, November 15, 2018
Time: 2 p.m. ET/11 a.m. PT
Datawatch Speakers: Dr. Steve Walker, Sr. Data Scientist and Mike Rowley, Product Director

Join us and learn how users of KnowledgeSTUDIO for Apache Spark, a wizard-driven productivity tool for building Spark workflows, have overcome these challenges.

Learn how data science teams can:

Utilise interactive workflows with an automated design canvas for building, displaying, refreshing, and reusing analytic models
Automatically generate code that can be customised and incorporated into production scripts
Include manually written code within the graphical workflow
Leverage advanced modelling with open source packages such as Spark ML, Spark SQL
Avoid overhead costs of parallelisation when datasets are very small
Build, explore data segments, and discover relationships using patented Decision Tree technology

Tuesday, 15 May 2018

WEBINAR: An Expert’s Guide to Apache Spark - 23 May 2018

Apache Spark™ has become the de-facto data processing and AI engine in enterprises today due to its speed, ease of use, and sophisticated analytics. As the first Unified Analytics engine to unify data with AI, Spark allows data engineering and data science teams to simplify data preparation and model training — enabling innovative AI use cases that leverage advanced analytics like machine learning, graph analytics, and deep learning.

Join Bill Chambers, author of the book "Spark: The Definitive Guide", and Matei Zaharia, Chief Technologist and Co-founder of Databricks and the orginal creator of Apache Spark™, in this Data Science Central webinar as he breaks down the basic operations and common functions of Spark and walks through sample use cases where Spark has helped accelerate AI innovation.

In this webinar, we will cover:

A gentle overview of big data and Spark
Expert guidance on how to use, deploy and maintain Spark
The fundamentals of monitoring, tuning, and debugging Spark
An exploration into machine learning techniques and scenarios for employing MLlib, Spark’s scalable machine-learning library

Speakers:
Bill Chambers, Product Manager -- Databricks
Matei Zaharia, Co-founder and Chief Technologist -- Databricks

Hosted by: Bill Vorhies, Editorial Director -- Data Science Central

Title:	An Expert’s Guide to Apache Spark™
Date:	Wednesday, May 23^rd, 2018
Time:	09:00 AM - 10:00 AM PDT

Wednesday, 24 January 2018

WEBINAR: Matei Zaharia’s Predictions for 2018: Big Data and AI Highlights - 31 Jan 2018

Overview

Title: Matei Zaharia’s Predictions for 2018: Big Data and AI Highlights

Date: Wednesday, January 31, 2018

Time: 09:00 AM Pacific Standard Time

Duration: 1 hour

Summary

Matei Zaharia’s Predictions for 2018: Big Data and AI Highlights

Over the past few years, AI and big data have powered numerous technologies that have changed the way we live, from autonomous cars to conversational systems to personalization. As a result, the excitement around these technologies has spiked. But how can we separate the hype from reality, and which advances will make an impact in practice next?

In this DSC webinar, Databricks co-founder and Stanford computer science professor Matei Zaharia, who started the Apache Spark project in 2009, will share his perspective on which big data and AI trends will come to fruition in 2018. He will discuss how centering organizations around high-quality data will be the main driver to AI, which AI applications are seeing broad success in practice, and how new technologies including deep learning, data marketplaces and cloud computing will affect the computing landscape.

Join this webinar to learn about:

The current state of big data and AI
Some of the new innovations taking place in research
Key challenges that companies face in getting value from data and AI
Matei’s predictions for 2018 for how companies and the technology industry will overcome these challenges

Speaker: Matei Zaharia, Co-founder and Chief Technologist -- Databricks

Hosted by: Bill Vorhies, Editorial Director -- Data Science Central

Tuesday, 14 November 2017

Top Big Data Skills To Help You Stand Out from the Crowd by Sarah Shannon via SmartDataCollective

Big Data is the latest buzzword hitting the technology sector with data analytics fast becoming the newest technique implemented by businesses to monitor their IT networks, and stop impending threats.

Definitely something to read and work out which skills you might be missing or would add to your offering if you worked on it.

Tuesday, 24 October 2017

Beyond Hadoop by James Ovendon via @iegroup

A company once synonymous with big data is on its way out, but what comes next?

Interesting. So people are starting to use alternate to Hadoop or using it for other reasons.

Thursday, 21 September 2017

WEBINAR: Deep Learning on Apache® Spark™- Best Practices - 27 Sept 2017

Overview

Title: Deep Learning on Apache® Spark™- Best Practices

Date: Wednesday, September 27, 2017

Time: 09:00 AM Pacific Daylight Time

Duration: 1 hour

Summary

Deep Learning on Apache® Spark™- Best Practices

The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Unified Analytics Platform, will present some best practices for building deep learning pipelines with Spark.

Rather than comparing deep learning systems or specific optimisations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:

Optimising cluster setup
Configuring the cluster
Ingesting data
Monitoring long-running jobs

Speaker: Tim Hunter, Software Engineer -- Databricks Inc.

Hosted by: Bill Vorhies, Editorial Director -- Data Science Central

Thursday, 14 September 2017

277 Data Science Key Terms, Explained by Matthew Mayo via @kdnuggets

This is a collection of 277 data science key terms, explained with a no-nonsense, concise approach. Read on to find terminology related to Big Data, machine learning, natural language processing, descriptive statistics, and much more.

This links to lots of articles grouping the terms by their general classification for example deep learning or predictive analytics.

Tuesday, 9 May 2017

WEBINAR: Apache® Spark™ MLlib 2.x: Productionize your Machine Learning Models - 16 May 2017

Overview

Title: Apache® Spark™ MLlib 2.x: Productionize your Machine Learning Models

Date: Tuesday, May 16, 2017

Time: 09:00 AM Pacific Daylight Time

Duration: 1 hour

Summary

Apache® Spark™ MLlib 2.x: Productionise your Machine Learning Models

Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these models to a production environment? How do I embed what I have learned into customer facing data applications?

In this latest Data Science Central webinar, we will discuss:

Best practices on how customers productionise machine learning models
Case studies with actual customers
Live tutorials of a few example architectures and code in Python, Scala, Java and SQL

Speaker: Richard Garris, Principal Solutions Architect -- Databricks Inc.

Hosted by: Bill Vorhies, Editorial Director -- Data Science Central

Join here

Monday, 20 March 2017

IBM Machine Learning brings Spark to the mainframe by @andrewbrust via @ZDNet

IBM has announced support for machine learning on Z-series mainframes. If a lot of your transactional processing is still happening on a mainframe and you want to build predictive models on their data it's definitely worth a look.

This is great news and I think can deliver big results for IBM and their customers.

Wednesday, 8 February 2017

SLIDESHOW: Top 10 Big Data Trends We’ll See in 2017 by Joe Caserta via @infomgmt

Last year was the year of ‘big data.’ This will be the year of ‘data intelligence,’ as organizations look for actionable insights from all that data. Here are 10 trends to expect.

Interesting list.

Tuesday, 7 February 2017

WEBNAR: How to Keep Your R Code Simple While Tackling Big Datasets - 14th February 2017

Overview

Title: How to Keep Your R Code Simple While Tackling Big Datasets

Date: Tuesday, February 14, 2017

Time: 09:00 AM Pacific Standard Time

Duration: 1 hour

Summary

How to Keep Your R Code Simple While Tackling Big Datasets

R, TERR, Spark and Python are tools that benefit from larger systems. Software-Defined Servers enable data scientists to size their processing system to the size of a particular data problem. In this Data Science Central webinar you will learn how Software-Defined Servers work in practice for several common data science tools and will explore how removing core and memory constraints has multiple, profound and positive implications for application developers tackling big data problems of all kinds.

Speaker: Michael Berman, Vice President of Engineering -- TidalScale

Hosted by: Bill Vorhies, Editorial Director -- Data Science Central

Thursday, 24 November 2016

WEBINAR: The DNA of a Data Science Rock Star - 29 November 2016

Overview

Title: The DNA of a Data Science Rock Star

Date: Tuesday, November 29, 2016

Time: 09:00 AM Pacific Standard Time

Duration: 1 hour

Summary

The DNA of a Data Science Rock Star

Data Scientists are tasked with transforming their organizations with data. Yet many are struggling to realize their true Rock Star potential, and organizations are missing out on what these Rock Stars could do with the right environment.

Join us for this latest Data Science Central Webinar and learn what skills, tools, and behaviors are emerging as the DNA of the Rock Star Data Scientist. We will explore best practices for Big Data Analytics through Open Source technologies (i.e. Apache Spark, R, R Studio, Python, Jupyter), techniques including machine learning and behaviors around collaboration, sharing and learning.

Speakers:
Carlo Appugliese, Hadoop & Spark Evangelist -- IBM Analytics
Greg Filla, Associate Offering Manager, Data Science Experience -- IBM Analytics

Hosted by:

Bill Vorhies, Editorial Director -- Data Science Central

Friday, 21 October 2016

Analysis without boundaries by Jacques Nadeau via @OReillyMedia

Apache Arrow makes it possible to use multiple languages and heterogeneous data infrastructure.

Wow - now that I can't wait to play with.

Monday, 5 September 2016

Airline Flight Data Analysis – Part 1 – Data Preparation by Michael Kamprath via @DIYBigData

This is the start of a PySpark data analysis project concerning airline on-time performance. In this post, the usefulness of the Apache Parquet data format is explained as data is loaded and cleaned.

This is a great post and has code and a link to his github containing the code.

Tuesday, 9 August 2016

Apache Spark: The Future of Big Data Science? by Matthew Thomson via @infomgmt

Spark is different from the myriad other solutions because it allows data scientists to develop simple code to perform distributed computing.

Yes this is definitely more flexible so is more efficient.

Friday, 29 July 2016

VIDEO: Distributed deep learning on Spark by Alexander Ulanov via @OReillyMedia

Alexander Ulanov offers an overview of tools and frameworks that have been proposed for performing deep learning on Spark.

I found this fascinating.

Saturday, 23 July 2016

Apache Spark: The Future of Big Data Science? by Matthew Thomson via @infomgmt

Spark is different from the myriad other solutions because it allows data scientists to develop simple code to perform distributed computing.

Interesting blog. Spark definitely seems worth learning.