Apache Spark

From Pandas to Apache Spark’s Dataframe July 31, 2015

RDDs are the new bytecode of Apache Spark May 29, 2015

With the Apache Spark 1.3 release the Dataframe API for Spark SQL got introduced, for those of you who missed the big announcements, I'd recommend to read the article : Introducing Dataframes in Spark for Large Scale Data Science from the Databricks blog. Dataframes are very popular among data scientists, personally I've mainly been using them with the great Python library Pandas but there are many examples in R (originally) and Julia.

Changing Spark’s default java serialization to Kryo January 9, 2015

Apache Spark's default serialization relies on Java with the default readObject(...) and writeObject(...) methods for all Serializable classes. This is a very fine default behavior as long as you don't rely on it too much...

Try Apache Spark’s shell using Docker December 18, 2014

Ever wanted to try out Apache Spark without actually having to install anything ? Well if you've got Docker, I've got a christmas present for you, a Docker image you can pull to try and run Spark commands in the Spark shell REPL. The image has been pushed to the Docker Hub here and can be easily pulled using Docker.

So exactly what is this image, and how can I use it ?

Apache Spark : Memory management and Graceful degradation December 11, 2014

Many of the concepts of Apache Spark are pretty straightforward and easy to understand, however some lucky few can be badly misunderstood. One of the greatest misunderstanding of all is the fact that some still believe that “Spark is only relevant with datasets that can fit into memory, otherwise it will crash”.

This is an understanding mistake, Spark being easily associated as a “Hadoop using RAM more efficiently”, but it still is a mistake.

Apache Spark : l’importance du broadcast November 27, 2014