Big data is part of our lives now and most companies collecting data have to deal with big data in order to gain meaningful insights from them. While we know complex neural networks work beautifully and accurately when we have a big data set, at times they are not the most ideal. In a situation where the complexity of prediction is high, however, the prediction does need to be fast and efficient. Therefore, we need a scalable machine learning solution.

Apache spark comes with SparkML. SparkML has great inbuilt machine learning algorithms which are optimised for parallel processing and hence…

Machine learning is continuously evolving as information asymmetry lessens and complex models and algorithm become easier to implement and use. Python libraries like scikit-learn require only a few lines of code(excluding pre-proc) to fit and make predictions using complex high-level ensemble learning techniques like Random Forrest. Does the question then arise what gives you the edge?

There are numerous guides and resources available online to write these few lines of code and predict accurately. The challenge then comes down to how to use ML efficiently and dynamically. In a real use-case, we know that we will not be using just…

Apache Spark

Apache Spark has become the prime tool for handling and managing big data. With the added advantage of being a completely open-source technology and a very active community, it has long replaced Hadoop’s Map Reduce. One of the reasons for that is simplicity, both in management and usage. It provides support for Scala, Python and R. Since Python is the most popular language for data science, I will be focusing on PySpark. However, not many changes are required to use any of the other two languages.

Spark Architecture on k8s
How it works

An application with all its dependencies is submitted to a Kubernetes cluster…

