type(df.id)
pyspark.sql.column.Column
Column
classAs we described at the introduction of Chapter 3, you will massively use the methods from the DataFrame
class in your Spark applications to manage, modify and calculate your Spark DataFrames.
However, there is one more python class that provides some very useful methods that you will regularly use, which is the Column
class, or more specifically, the pyspark.sql.column.Column
class.
The Column
class is used to represent a column in a Spark DataFrame. This means that, each column of your Spark DataFrame is a object of class Column
.
We can confirm this statement, by taking the df
DataFrame that we showed at Section 3.4, and look at the class of any column of it. Like the id
column:
You can refer to or create a column, by using the col()
and column()
functions from pyspark.sql.functions
module. These functions receive a string input with the name of the column you want to create/refer to.
Their result are always a object of class Column
. For example, the code below creates a column called ID
:
Many kinds of transformations that we want to apply over a Spark DataFrame, are usually described through expressions, and, these expressions in Spark are mainly composed by column transformations. That is why the Column
class, and its methods, are so important in Apache Spark.
Columns in Spark are so strongly related to expressions that the columns themselves are initially interpreted as expressions. If we look again at the column id
from df
DataFrame, Spark will bring a expression as a result, and not the values hold by this column.
Having these ideas in mind, when I created the column ID
on the previous section, I created a “column expression”. This means that col("ID")
is just an expression, and as consequence, Spark does not know which are the values of column ID
, or, where it lives (which is the DataFrame that this column belongs?). For now, Spark is not interested in this information, it just knows that we have an expression referring to a column called ID
.
These ideas relates a lot to the lazy aspect of Spark that we talked about in Section 3.5. Spark will not perform any heavy calculation, or show you the actual results/values from you code, until you trigger the calculations with an action (we will talk more about these “actions” on Section 5.2). As a result, when you access a column, Spark will only deliver a expression that represents that column, and not the actual values of that column.
This is handy, because we can store our expressions in variables, and, reuse it latter, in multiple parts of our code. For example, I can keep building and merging a column with different kinds of operators, to build a more complex expression. In the example below, I create a expression that doubles the values of ID
column:
Remember, with this expression, Spark knows that we want to get a column called ID
somewhere, and double its values. But Spark will not perform that action right now.
Logical expressions follow the same logic. In the example below, I am looking for rows where the value in column Name
is equal to 'Anne'
, and, the value in column Grade
is above 6.
Again, Spark just checks if this is a valid logical expression. For now, Spark does not want to know where are these Name
and Grade
columns. Spark does not evaluate the expression, until we ask for it with an action:
We know now that columns of a Spark DataFrame have a deep connection with expressions. But, on the other hand, there are some situations that you write a value (it can be a string, a integer, a boolean, or anything) inside your pyspark
code, and you might actually want Spark to intepret this value as a constant (or a literal) value, rather than a expression.
As an example, lets suppose you control the data generated by the sales of five different stores, scattered across different regions of Belo Horizonte city (in Brazil). Now, lets suppose you receive a batch of data generated by the 4th store in the city, which is located at Amazonas Avenue, 324. This batch of data is exposed below:
+-----+----------+------------+-------+-------------------+-----+
|price|product_id|product_name|sale_id| timestamp|units|
+-----+----------+------------+-------+-------------------+-----+
| 3.12| 134| Milk 1L Mua| 328711|2022-02-01T22:10:02| 1|
| 1.22| 110| Coke 350ml| 328712|2022-02-03T11:42:09| 3|
| 4.65| 117| Pepsi 2L| 328713|2022-02-03T14:22:15| 1|
| 1.22| 110| Coke 350ml| 328714|2022-02-03T18:33:08| 1|
| 0.85| 341|Trident Mint| 328715|2022-02-04T15:41:36| 1|
+-----+----------+------------+-------+-------------------+-----+
If you look at this batch… there is no indication that these sales come from the 4th store. In other words, this information is not present in the data, is just in your mind. It certainly is a very bad idea to leave this data as is, whithout any identification of the source of it. So, you might want to add some labels and new columns to this batch of data, that can easily identify the store that originated these sales.
For example, we could add two new columns to this sales
DataFrame. One for the number that identifies the store (4), and, another to keep the store address. Considering that all rows in this batch comes from the 4th store, we should add two “constant” columns, meaning that these columns should have a constant value across all rows in this batch. But, how can we do this? How can we create a “constant” column? The answer is: by forcing Spark to interpret the values as literal values, instead of a expression.
In other words, I can not use the col()
function to create these two new columns. Because this col()
function receives a column name as input. It interprets our input as an expression that refers to a column name. This function does not accept some sort of description of the actual values that this column should store.
So how do we force Spark to interpret a value as a literal (or constant) value, rather than a expression? To do this, you must write this value inside the lit()
(short for “literal”) function from the pyspark.sql.functions
module.
In other words, when you write in your code the statement lit(4)
, Spark understand that you want to create a new column which is filled with 4’s. In other words, this new column is filled with the constant integer 4.
With the code below, I am creating two new columns (called store_number
and store_address
), and adding them to the sales
DataFrame.
from pyspark.sql.functions import lit
store_number = lit(4).alias('store_number')
store_address = lit('Amazonas Avenue, 324').alias('store_address')
sales = sales\
.select(
'*', store_number, store_address
)
sales\
.select(
'product_id', 'product_name',
'store_number', 'store_address'
)\
.show(5)
+----------+------------+------------+--------------------+
|product_id|product_name|store_number| store_address|
+----------+------------+------------+--------------------+
| 134| Milk 1L Mua| 4|Amazonas Avenue, 324|
| 110| Coke 350ml| 4|Amazonas Avenue, 324|
| 117| Pepsi 2L| 4|Amazonas Avenue, 324|
| 110| Coke 350ml| 4|Amazonas Avenue, 324|
| 341|Trident Mint| 4|Amazonas Avenue, 324|
+----------+------------+------------+--------------------+
In essence, you normally use the lit()
function when you want to write a literal value in places where Spark expects a column name. In the example above, instead of writing a name to an existing column in the sales
DataFrame, I wanted to write the literal values 'Amazonas Avenue, 324'
and 4
, and I used the lit()
function to make this intention very clear to Spark. If I did not used the lit()
function, the withColumn()
method would interpret the value 'Amazonas Avenue, 324'
as an existing column named Amazonas Avenue, 324
.
Column
classBecause many transformations that we want to apply over our DataFrames are expressed as column transformations, the methods from the Column
class will be quite useful on many different contexts. You will see many of these methods across the next chapters, like desc()
, alias()
and cast()
.
Remember, you can always use the dir()
function to see the complete list of methods available in any python class. Is always useful to check the official documentation too1. There you will have a more complete description of each method.
But since they are so important in Spark, lets just give you a brief overview of some of the most popular methods from the Column
class (these methods will be described in more detail in later chapters):
desc()
and asc()
: methods to order the values of the column in a descending or ascending order (respectively);cast()
and astype()
: methods to cast (or convert) the values of the column to a specific data type;alias()
: method to rename a column;substr()
: method that returns a new column with the sub string of each value;isNull()
and isNotNull()
: logical methods to test if each value in the column is a null value or not;startswith()
and endswith()
: logical methods to search for values that starts with or ends with a specific pattern;like()
and rlike()
: logical methods to search for a specific pattern or regular expression in the values of the column;isin()
: logical method to test if each value in the column is some of the listed values;