1 Key concepts of python

If you have experience with python, and understands how objects and classes works, you might want to skip this entire chapter. But, if you are new to the language and do not have much experience with it, you might want to stick a little bit, and learn a few key concepts that will help you to understand how the pyspark package is organized, and how to work with it.

1.1 Scripts

Python programs are written in plain text files that are saved with the .py extension. After you save these files, they are usually called “scripts”. So a script is just a text file that contains all the commands that make your python program.

There are many IDEs or programs that help you to write, manage, run and organize this kind of files (like Microsoft Visual Studio Code¹, PyCharm², Anaconda³ and RStudio⁴). Many of these programs are free to use, and, are easy to install.

But, if you do not have any of them installed, you can just create a new plain text file from the built-in Notepad program of your OS (operational system), and, save it with the .py extension.

1.2 How to run a python program

As you learn to write your Spark applications with pyspark, at some point, you will want to actually execute this pyspark program, to see its result. To do so, you need to execute it as a python program. There are many ways to run a python program, but I will show you the more “standard” way. That is to use the python command inside the terminal of your OS (you need to have python already installed).

As an example, lets create a simple “Hello world” program. First, open a new text file then save it somewhere in your machine (with the name hello.py). Remember to save the file with the .py extension. Then copy and paste the following command into this file:

print("Hello World!")

It will be much easier to run this script, if you open your OS’s terminal inside the folder where you save the hello.py file. After you opened the terminal inside the folder, just run the python3 hello.py command. As a result, python will execute hello.py, and, the text Hello World! should be printed to the terminal:

Terminal$ python3 hello.py

Hello World!

But, if for some reason you could not open the terminal inside the folder, just open a terminal (in any way you can), then, use the cd command (stands for “change directory”) with the path to the folder where you saved hello.py. This way, your terminal will be rooted in this folder.

For example, if I saved hello.py inside my Documents folder, the path to this folder in Windows would be something like this: "C:\Users\pedro\Documents". On the other hand, this path on Linux would be something like "/usr/pedro/Documents". So the command to change to this directory would be:

# On Windows:
Terminal$ cd "C:\Users\pedro\Documents"
# On Linux:
Terminal$ cd "/usr/pedro/Documents"

After this cd command, you can run the python hello.py command in the terminal, and get the exact same result of the previous example.

There you have it! So every time you need to run your python program (or your pyspark program), just open a terminal and run the command python <complete path to your script>. If the terminal is rooted on the folder where you saved your script, you can just use the python <name of the script> command.

1.3 Objects

Although python is a general-purpose language, most of its features are focused on object-oriented programming. Meaning that, python is a programming language focused on creating, managing and modifying objects and classes of objects.

So, when you work with python, you are basically applying many operations and functions over a set of objects. In essence, an object in python, is a name that refers to a set of data. This data can be anything that you computer can store (or represent).

Having that in mind, an object is just a name, and this name is a reference, or a key to access some data. To define an object in python, you must use the assignment operator, which is the equal sign (=). In the example below, we are defining, or, creating an object called x, and it stores the value 10. Therefore, with the name x we can access this value of 10.

x = 10
print(x)

When we store a value inside an object, we can easily reuse this value in multiple operations or expressions:

# Multiply by 2
print(x * 2)

# Divide by 3
print(x / 3)

3.3333333333333335

# Print its class
print(type(x))

<class 'int'>

Remember, an object can store any type of value, or any type of data. For example, it can store a single string, like the object salutation below:

salutation = "Hello! My name is Pedro"

Or, a list of multiple strings:

names = [
  "Anne", "Vanse", "Elliot",
  "Carlyle", "Ed", "Memphis"
]

print(names)

['Anne', 'Vanse', 'Elliot', 'Carlyle', 'Ed', 'Memphis']

Or a dict containing the description of a product:

product = {
  'name': 'Coca Cola',
  'volume': '2 litters',
  'price': 2.52,
  'group': 'non-alcoholic drinks',
  'department': 'drinks'
}

print(product)

{'name': 'Coca Cola', 'volume': '2 litters', 'price': 2.52, 'group': 'non-alcoholic drinks', 'department': 'drinks'}

And many other things…

1.4 Expressions

Python programs are organized in blocks of expressions (or statements). A python expression is a statement that describes an operation to be performed by the program. For example, the expression below describes the sum between 3 and 5.

3 + 5

The expression above is composed of numbers (like 3 and 5) and a operator, more specifically, the sum operator (+). But any python expression can include a multitude of different items. It can be composed of functions (like print(), map() and str()), constant strings (like "Hello World!"), logical operators (like !=, <, > and ==), arithmetic operators (like *, /, **, %, - and +), structures (like lists, arrays and dicts) and many other types of commands.

Below we have a more complex example, that contains the def keyword (which starts a function definition; in the example below, this new function being defined is double()), many built-in functions (list(), map() and print()), a arithmetic operator (*), numbers and a list (initiated by the pair of brackets - []).

def double(x):
  return x * 2
  
print(list(map(double, [4, 2, 6, 1])))

[8, 4, 12, 2]

Python expressions are evaluated in a sequential manner (from top to bottom of your python file). In other words, python runs the first expression in the top of your file, them, goes to the second expression, and runs it, them goes to the third expression, and runs it, and goes on and on in that way, until it hits the end of the file. So, in the example above, python executes the function definition (initiated at def double(x):), before it executes the print() statement, because the print statement is below the function definition.

This order of evaluation is commonly referred as “control flow” in many programming languages. Sometimes, this order can be a fundamental part of the python program. Meaning that, sometimes, if we change the order of the expressions in the program, we can produce unexpected results (like an error), or change the results produced by the program.

As an example, the program below prints the result 4, because the print statement is executed before the expression x = 40.

x = 1

print(x * 4)

x = 40

But, if we execute the expression x = 40 before the print statement, we then change the result produced by the program.

x = 1
x = 40

print(x * 4)

If we go a little further, and, put the print statement as the first expression of the program, we then get a name error. This error warns us that, the object named x is not defined (i.e. it does not exist).

print(x * 4)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined

x = 1
x = 40

This error occurs, because inside the print statement, we call the name x. But, this is the first expression of the program, and at this point of the program, we did not defined a object called x. We make this definition, after the print statement, with x = 1 and x = 40. In other words, at this point, python do not know any object called x.

1.5 Packages (or libraries)

A python package (or a python “library”) is basically a set of functions and classes that provides important functionality to solve a specific problem. And pyspark is one of these many python packages available.

Python packages are usually published (that is, made available to the public) through the PyPI archive⁵. If a python package is published in PyPI, then, you can easily install it through the pip tool.

To use a python package, you always need to: 1) have this package installed on your machine; 2) import this package in your python script. If a package is not installed in your machine, you will face a ModuleNotFoundError as you try to use it, like in the example below.

import pandas

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'

If your program produce this error, is very likely that you are trying to use a package that is not currently installed on your machine. To install it, you may use the pip install <name of the package> command on the terminal of your OS.

Terminal$ pip install pandas

But, if this package is already installed in your machine, then, you can just import it to your script. To do this, you just include an import statement at the start of your python file. For example, if I want to use the DataFrame function from the pandas package:

# Now that I installed the `pandas` package with `pip`
# this `import` statement works without any errors:
import pandas

df = pandas.DataFrame([
  (1, 3214), (2, 4510), 
  (1, 9082), (4, 7822)
])

print(df)

Therefore, with import pandas I can access any of the functions available in the pandas package, by using the dot operator after the name of the package (pandas.<name of the function>). However, it can become very annoying to write pandas. every time you want to access a function from pandas, specially if you use it constantly in your code.

To make life a little easier, python offers some alternative ways to define this import statement. First, you can give an alias to this package that is shorter/easier to write. As an example, nowadays, is virtually a industry standard to import the pandas package as pd. To do this, you use the as keyword in your import statement. This way, you can access the pandas functionality with pd.<name of the function>:

import pandas as pd

df = pd.DataFrame([
  (1, 3214), (2, 4510), 
  (1, 9082), (4, 7822)
])

print(df)

In contrast, if you want to make your life even easier and produce a more “clean” code, you can import (from the package) just the functions that you need to use. In this method, you can eliminate the dot operator, and refer directly to the function by its name. To use this method, you include the from keyword in your import statement, like this:

from pandas import DataFrame

df = DataFrame([
  (1, 3214), (2, 4510), 
  (1, 9082), (4, 7822)
])

print(df)

Just to be clear, you can import multiple functions from the package, by listing them. Or, if you prefer, you can import all components of the package (or module/sub-module) by using the star shortcut (*):

# Import `search()`, `match()` and `compile()` functions:
from re import search, match, compile
# Import all functions from the `os` package
from os import *

Some packages may be very big, and includes many different functions and classes. As the size of the package becomes bigger and bigger, developers tend to divide this package in many “modules”. In other words, the functions and classes of this python package are usually organized in “modules”.

As an example, the pyspark package is a fairly large package, that contains many classes and functions. Because of it, the package is organized in a number of modules, such as sql (to access Spark SQL), pandas (to access the Pandas API of Spark), ml (to access Spark MLib).

To access the functions available in each one of these modules, you use the dot operator between the name of the package and the name of the module. For example, to import all components from the sql and pandas modules of pyspark, you would do this:

from pyspark.sql import *
from pyspark.pandas import *

Going further, we can have sub-modules (or modules inside a module) too. As an example, the sql module of pyspark have the functions and window sub-modules. To access these sub-modules, you use the dot operator again:

# Importing `functions` and `window` sub-modules:
import pyspark.sql.functions as F
import pyspark.sql.window as W

1.6 Methods versus Functions

Beginners tend mix these two types of functions in python, but they are not the same. So lets describe the differences between the two.

Standard python functions, are functions that we apply over an object. A classical example, is the print() function. You can see in the example below, that we are applying print() over the result object.

result = 10 + 54
print(result)

Other examples of a standard python function would be map() and list(). See in the example below, that we apply the map() function over a set of objects:

words = ['apple', 'star', 'abc']
lengths = map(len, words)
list(lengths)

[5, 4, 3]

In contrast, a python method is a function registered inside a python class. In other words, this function belongs to the class itself, and cannot be used outside of it. This means that, in order to use a method, you need to have an instance of the class where it is registered.

For example, the startswith() method belongs to the str class (this class is used to represent strings in python). So to use this method, we need to have an instance of this class saved in a object that we can access. Note in the example below, that we access the startswith() method through the name object. This means that, startswith() is a function. But, we cannot use it without an object of class str, like name.

name = "Pedro"
name.startswith("P")

True

Note in the example above, that we access any class method in the same way that we would access a sub-module/module of a package. That is, by using the dot operator (.).

So, if we have a class called people, and, this class has a method called location(), we can use this location() method by using the dot operator (.) with the name of an object of class people. If an object called x is an instance of people class, then, we can do x.location().

But if this object x is of a different class, like int, then we can no longer use the location() method, because this method does not belong to the int class. For example, if your object is from class A, and, you try to use a method of class B, you will get an AttributeError.

In the example exposed below, I have an object called number of class int, and, I try to use the method startswith() from str class with this object:

number = 2
# You can see below that, the `x` object have class `int`
type(number)
# Trying to use a method from `str` class
number.startswith("P")

AttributeError: 'int' object has no attribute 'startswith'

1.7 Identifying classes and their methods

Over the next chapters, you will realize that pyspark programs tend to use more methods than standard functions. So most of the functionality of pyspark resides in class methods. As a result, the capability of understanding the objects that you have in your python program, and, identifying its classes and methods will be crucial while you are developing and debugging your Spark applications.

Every existing object in python represents an instance of a class. In other words, every object in python is associated to a given class. You can always identify the class of an object, by applying the type() function over this object. In the example below, we can see that, the name object is an instance of the str class.

name = "Pedro"
type(name)

str

If you do not know all the methods that a class have, you can always apply the dir() function over this class to get a list of all available methods. For example, lets suppose you wanted to see all methods from the str class. To do so, you would do this:

dir(str)

['__add__', '__class__', '__contains__', '__delattr__', '__dir__',
 '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__',
 '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__',
 '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', 
 '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
 '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', 
 '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center',
 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format',
 'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal',
 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable',
 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 
 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 
 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith',
 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']