Introduction: Pyspark is one of the first big data tools and one of the fastest too. In this article, we will discuss the introductory part of pyspark and share a lot of learning inspired from datacamp's course. The first step: The first step in using Spark is connecting to a cluster. In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker . The master sends the workers data and calculations to run, and they send their results back to the master. Creating a connection to spark: Creating the connection is as simple as creating an instance of the SparkContext class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to. An object holding all these att...
I write about machine learning models, python programming, web scraping, statistical tests and other coding or data science related things I find interesting. Read, learn and grow with me!