Data

From Pandas to Apache Spark’s Dataframe July 31, 2015

RDDs are the new bytecode of Apache Spark May 29, 2015

With the Apache Spark 1.3 release the Dataframe API for Spark SQL got introduced, for those of you who missed the big announcements, I'd recommend to read the article : Introducing Dataframes in Spark for Large Scale Data Science from the Databricks blog. Dataframes are very popular among data scientists, personally I've mainly been using them with the great Python library Pandas but there are many examples in R (originally) and Julia.

Changing Spark’s default java serialization to Kryo January 9, 2015

Apache Spark's default serialization relies on Java with the default readObject(...) and writeObject(...) methods for all Serializable classes. This is a very fine default behavior as long as you don't rely on it too much...

Highlighting field in memory-based Lucene indexes June 24, 2013

I'm using more and more Lucene these days, and getting in depth on a few subjects, today i'm going to talk to you about how to handle the new Highlighting features available with Lucene 4.1.

How to test and understand custom analyzers in Lucene June 20, 2013

I've began to work more and more with the great “low-level” library Apache Lucene created by Doug Cutting. For those of you that may not know, Lucene is the indexing and searching library used by great entreprise search servers like Apache Solr and Elasticsearch.

Elasticsearch is the way March 12, 2013

Don't get me wrong, i love Apache Solr, i think it's a wonderful project and the versions 4.x are definitely something you should check out when building a proper search engine.

Sharing PyPi/Maven dependency data January 31, 2013

As time is always running out, i don't think i'll have the time in a while to work again on the data I collected for the last three articles, Going offline with Maven, State of the Maven/Java dependency graph and State of the PyPi/Python dependency graph.

Going offline with Maven January 14, 2013

At Lateral-Thoughts, we organize at least once a year, what we call a “Timeoff” where we get together in a nice place and hack on what we want. It can be a learning period or a startup weekend-like event where we hack on a product/idea. Last time it was in a nice house in Guérande where we had everything we needed, internet access, rooms, tables, lots of space, an indoor swimming pool and a barbecue !

State of the Maven/Java dependency graph January 11, 2013

So here it comes, the second part of a three part articles on dependencies in different world, the first part was about Python/PyPi dependencies and considering the size of the graph : 20661 Nodes, 14047 Edges, I was able to show you the graph in an interactive javascript app using SigmaJS. But this times it's different, after extracting the metadata from Maven repositories, the raw data file generated weights 273M, and the size of the whole directed dependency graph is 186 384 Nodes and 1 229 083 Edges, in other words, it's going to be tough to show you the whole graph interactively but the raw data, the graph file and the Gephi file are available on the GitHub project.

State of the Python/PyPi dependency graph January 5, 2013

I usually work in Java/Maven environment, so when I explain to people that Python also has a package manager - a bit less heavy than maven - and that it's working pretty well, I always have to answer the same question : “Ok, but how does it solve the transitive dependency hell ?”