print("Hello World!")
1 Key concepts of python
If you have experience with python, and understands how objects and classes works, you might want to skip this entire chapter. But, if you are new to the language and do not have much experience with it, you might want to stick a little bit, and learn a few key concepts that will help you to understand how the pyspark
package is organized, and how to work with it.
1.1 Scripts
Python programs are written in plain text files that are saved with the .py
extension. After you save these files, they are usually called “scripts”. So a script is just a text file that contains all the commands that make your python program.
There are many IDEs or programs that help you to write, manage, run and organize this kind of files (like Microsoft Visual Studio Code1, PyCharm2, Anaconda3 and RStudio4). Many of these programs are free to use, and, are easy to install.
But, if you do not have any of them installed, you can just create a new plain text file from the built-in Notepad program of your OS (operational system), and, save it with the .py
extension.
1.2 How to run a python program
As you learn to write your Spark applications with pyspark
, at some point, you will want to actually execute this pyspark
program, to see its result. To do so, you need to execute it as a python program. There are many ways to run a python program, but I will show you the more “standard” way. That is to use the python
command inside the terminal of your OS (you need to have python already installed).
As an example, lets create a simple “Hello world” program. First, open a new text file then save it somewhere in your machine (with the name hello.py
). Remember to save the file with the .py
extension. Then copy and paste the following command into this file:
It will be much easier to run this script, if you open your OS’s terminal inside the folder where you save the hello.py
file. After you opened the terminal inside the folder, just run the python3 hello.py
command. As a result, python will execute hello.py
, and, the text Hello World!
should be printed to the terminal:
Terminal$ python3 hello.py
Hello World!
But, if for some reason you could not open the terminal inside the folder, just open a terminal (in any way you can), then, use the cd
command (stands for “change directory”) with the path to the folder where you saved hello.py
. This way, your terminal will be rooted in this folder.
For example, if I saved hello.py
inside my Documents folder, the path to this folder in Windows would be something like this: "C:\Users\pedro\Documents"
. On the other hand, this path on Linux would be something like "/usr/pedro/Documents"
. So the command to change to this directory would be:
# On Windows:
Terminal$ cd "C:\Users\pedro\Documents"
# On Linux:
Terminal$ cd "/usr/pedro/Documents"
After this cd
command, you can run the python hello.py
command in the terminal, and get the exact same result of the previous example.
There you have it! So every time you need to run your python program (or your pyspark
program), just open a terminal and run the command python <complete path to your script>
. If the terminal is rooted on the folder where you saved your script, you can just use the python <name of the script>
command.
1.3 Objects
Although python is a general-purpose language, most of its features are focused on object-oriented programming. Meaning that, python is a programming language focused on creating, managing and modifying objects and classes of objects.
So, when you work with python, you are basically applying many operations and functions over a set of objects. In essence, an object in python, is a name that refers to a set of data. This data can be anything that you computer can store (or represent).
Having that in mind, an object is just a name, and this name is a reference, or a key to access some data. To define an object in python, you must use the assignment operator, which is the equal sign (=
). In the example below, we are defining, or, creating an object called x
, and it stores the value 10. Therefore, with the name x
we can access this value of 10.
= 10
x print(x)
10
When we store a value inside an object, we can easily reuse this value in multiple operations or expressions:
# Multiply by 2
print(x * 2)
20
# Divide by 3
print(x / 3)
3.3333333333333335
# Print its class
print(type(x))
<class 'int'>
Remember, an object can store any type of value, or any type of data. For example, it can store a single string, like the object salutation
below:
= "Hello! My name is Pedro" salutation
Or, a list of multiple strings:
= [
names "Anne", "Vanse", "Elliot",
"Carlyle", "Ed", "Memphis"
]
print(names)
['Anne', 'Vanse', 'Elliot', 'Carlyle', 'Ed', 'Memphis']
Or a dict containing the description of a product:
= {
product 'name': 'Coca Cola',
'volume': '2 litters',
'price': 2.52,
'group': 'non-alcoholic drinks',
'department': 'drinks'
}
print(product)
{'name': 'Coca Cola', 'volume': '2 litters', 'price': 2.52, 'group': 'non-alcoholic drinks', 'department': 'drinks'}
And many other things…
1.4 Expressions
Python programs are organized in blocks of expressions (or statements). A python expression is a statement that describes an operation to be performed by the program. For example, the expression below describes the sum between 3 and 5.
3 + 5
8
The expression above is composed of numbers (like 3 and 5) and a operator, more specifically, the sum operator (+
). But any python expression can include a multitude of different items. It can be composed of functions (like print()
, map()
and str()
), constant strings (like "Hello World!"
), logical operators (like !=
, <
, >
and ==
), arithmetic operators (like *
, /
, **
, %
, -
and +
), structures (like lists, arrays and dicts) and many other types of commands.
Below we have a more complex example, that contains the def
keyword (which starts a function definition; in the example below, this new function being defined is double()
), many built-in functions (list()
, map()
and print()
), a arithmetic operator (*
), numbers and a list (initiated by the pair of brackets - []
).
def double(x):
return x * 2
print(list(map(double, [4, 2, 6, 1])))
[8, 4, 12, 2]
Python expressions are evaluated in a sequential manner (from top to bottom of your python file). In other words, python runs the first expression in the top of your file, them, goes to the second expression, and runs it, them goes to the third expression, and runs it, and goes on and on in that way, until it hits the end of the file. So, in the example above, python executes the function definition (initiated at def double(x):
), before it executes the print()
statement, because the print statement is below the function definition.
This order of evaluation is commonly referred as “control flow” in many programming languages. Sometimes, this order can be a fundamental part of the python program. Meaning that, sometimes, if we change the order of the expressions in the program, we can produce unexpected results (like an error), or change the results produced by the program.
As an example, the program below prints the result 4, because the print statement is executed before the expression x = 40
.
= 1
x
print(x * 4)
= 40 x
4
But, if we execute the expression x = 40
before the print statement, we then change the result produced by the program.
= 1
x = 40
x
print(x * 4)
160
If we go a little further, and, put the print statement as the first expression of the program, we then get a name error. This error warns us that, the object named x
is not defined (i.e. it does not exist).
print(x * 4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'x' is not defined
= 1
x = 40 x
This error occurs, because inside the print statement, we call the name x
. But, this is the first expression of the program, and at this point of the program, we did not defined a object called x
. We make this definition, after the print statement, with x = 1
and x = 40
. In other words, at this point, python do not know any object called x
.
1.5 Packages (or libraries)
A python package (or a python “library”) is basically a set of functions and classes that provides important functionality to solve a specific problem. And pyspark
is one of these many python packages available.
Python packages are usually published (that is, made available to the public) through the PyPI archive5. If a python package is published in PyPI, then, you can easily install it through the pip
tool.
To use a python package, you always need to: 1) have this package installed on your machine; 2) import this package in your python script. If a package is not installed in your machine, you will face a ModuleNotFoundError
as you try to use it, like in the example below.
import pandas
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'
If your program produce this error, is very likely that you are trying to use a package that is not currently installed on your machine. To install it, you may use the pip install <name of the package>
command on the terminal of your OS.
Terminal$ pip install pandas
But, if this package is already installed in your machine, then, you can just import it to your script. To do this, you just include an import
statement at the start of your python file. For example, if I want to use the DataFrame
function from the pandas
package:
# Now that I installed the `pandas` package with `pip`
# this `import` statement works without any errors:
import pandas
= pandas.DataFrame([
df 1, 3214), (2, 4510),
(1, 9082), (4, 7822)
(
])
print(df)
0 1
0 1 3214
1 2 4510
2 1 9082
3 4 7822
Therefore, with import pandas
I can access any of the functions available in the pandas
package, by using the dot operator after the name of the package (pandas.<name of the function>
). However, it can become very annoying to write pandas.
every time you want to access a function from pandas
, specially if you use it constantly in your code.
To make life a little easier, python offers some alternative ways to define this import
statement. First, you can give an alias to this package that is shorter/easier to write. As an example, nowadays, is virtually a industry standard to import the pandas
package as pd
. To do this, you use the as
keyword in your import
statement. This way, you can access the pandas
functionality with pd.<name of the function>
:
import pandas as pd
= pd.DataFrame([
df 1, 3214), (2, 4510),
(1, 9082), (4, 7822)
(
])
print(df)
0 1
0 1 3214
1 2 4510
2 1 9082
3 4 7822
In contrast, if you want to make your life even easier and produce a more “clean” code, you can import (from the package) just the functions that you need to use. In this method, you can eliminate the dot operator, and refer directly to the function by its name. To use this method, you include the from
keyword in your import statement, like this:
from pandas import DataFrame
= DataFrame([
df 1, 3214), (2, 4510),
(1, 9082), (4, 7822)
(
])
print(df)
0 1
0 1 3214
1 2 4510
2 1 9082
3 4 7822
Just to be clear, you can import multiple functions from the package, by listing them. Or, if you prefer, you can import all components of the package (or module/sub-module) by using the star shortcut (*
):
# Import `search()`, `match()` and `compile()` functions:
from re import search, match, compile
# Import all functions from the `os` package
from os import *
Some packages may be very big, and includes many different functions and classes. As the size of the package becomes bigger and bigger, developers tend to divide this package in many “modules”. In other words, the functions and classes of this python package are usually organized in “modules”.
As an example, the pyspark
package is a fairly large package, that contains many classes and functions. Because of it, the package is organized in a number of modules, such as sql
(to access Spark SQL), pandas
(to access the Pandas API of Spark), ml
(to access Spark MLib).
To access the functions available in each one of these modules, you use the dot operator between the name of the package and the name of the module. For example, to import all components from the sql
and pandas
modules of pyspark
, you would do this:
from pyspark.sql import *
from pyspark.pandas import *
Going further, we can have sub-modules (or modules inside a module) too. As an example, the sql
module of pyspark
have the functions
and window
sub-modules. To access these sub-modules, you use the dot operator again:
# Importing `functions` and `window` sub-modules:
import pyspark.sql.functions as F
import pyspark.sql.window as W
1.6 Methods versus Functions
Beginners tend mix these two types of functions in python, but they are not the same. So lets describe the differences between the two.
Standard python functions, are functions that we apply over an object. A classical example, is the print()
function. You can see in the example below, that we are applying print()
over the result
object.
= 10 + 54
result print(result)
64
Other examples of a standard python function would be map()
and list()
. See in the example below, that we apply the map()
function over a set of objects:
= ['apple', 'star', 'abc']
words = map(len, words)
lengths list(lengths)
[5, 4, 3]
In contrast, a python method is a function registered inside a python class. In other words, this function belongs to the class itself, and cannot be used outside of it. This means that, in order to use a method, you need to have an instance of the class where it is registered.
For example, the startswith()
method belongs to the str
class (this class is used to represent strings in python). So to use this method, we need to have an instance of this class saved in a object that we can access. Note in the example below, that we access the startswith()
method through the name
object. This means that, startswith()
is a function. But, we cannot use it without an object of class str
, like name
.
= "Pedro"
name "P") name.startswith(
True
Note in the example above, that we access any class method in the same way that we would access a sub-module/module of a package. That is, by using the dot operator (.
).
So, if we have a class called people
, and, this class has a method called location()
, we can use this location()
method by using the dot operator (.
) with the name of an object of class people
. If an object called x
is an instance of people
class, then, we can do x.location()
.
But if this object x
is of a different class, like int
, then we can no longer use the location()
method, because this method does not belong to the int
class. For example, if your object is from class A
, and, you try to use a method of class B
, you will get an AttributeError
.
In the example exposed below, I have an object called number
of class int
, and, I try to use the method startswith()
from str
class with this object:
= 2
number # You can see below that, the `x` object have class `int`
type(number)
# Trying to use a method from `str` class
"P") number.startswith(
AttributeError: 'int' object has no attribute 'startswith'
1.7 Identifying classes and their methods
Over the next chapters, you will realize that pyspark
programs tend to use more methods than standard functions. So most of the functionality of pyspark
resides in class methods. As a result, the capability of understanding the objects that you have in your python program, and, identifying its classes and methods will be crucial while you are developing and debugging your Spark applications.
Every existing object in python represents an instance of a class. In other words, every object in python is associated to a given class. You can always identify the class of an object, by applying the type()
function over this object. In the example below, we can see that, the name
object is an instance of the str
class.
= "Pedro"
name type(name)
str
If you do not know all the methods that a class have, you can always apply the dir()
function over this class to get a list of all available methods. For example, lets suppose you wanted to see all methods from the str
class. To do so, you would do this:
dir(str)
['__add__', '__class__', '__contains__', '__delattr__', '__dir__',
'__doc__', '__eq__', '__format__', '__ge__', '__getattribute__',
'__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__',
'__init_subclass__', '__iter__', '__le__', '__len__', '__lt__',
'__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__',
'__str__', '__subclasshook__', 'capitalize', 'casefold', 'center',
'count', 'encode', 'endswith', 'expandtabs', 'find', 'format',
'format_map', 'index', 'isalnum', 'isalpha', 'isascii', 'isdecimal',
'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable',
'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip',
'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust',
'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith',
'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']