4 Introducing the `Column` class

As we described at the introduction of Chapter 3, you will massively use the methods from the DataFrame class in your Spark applications to manage, modify and calculate your Spark DataFrames.

However, there is one more python class that provides some very useful methods that you will regularly use, which is the Column class, or more specifically, the pyspark.sql.column.Column class.

The Column class is used to represent a column in a Spark DataFrame. This means that, each column of your Spark DataFrame is a object of class Column.

We can confirm this statement, by taking the df DataFrame that we showed at Section 3.4, and look at the class of any column of it. Like the id column:

type(df.id)

pyspark.sql.column.Column

4.1 Building a column object

You can refer to or create a column, by using the col() and column() functions from pyspark.sql.functions module. These functions receive a string input with the name of the column you want to create/refer to.

Their result are always a object of class Column. For example, the code below creates a column called ID:

from pyspark.sql.functions import col
id_column = col('ID')
print(id_column)

Column<'ID'>

4.2 Columns are strongly related to expressions

Many kinds of transformations that we want to apply over a Spark DataFrame, are usually described through expressions, and, these expressions in Spark are mainly composed by column transformations. That is why the Column class, and its methods, are so important in Apache Spark.

Columns in Spark are so strongly related to expressions that the columns themselves are initially interpreted as expressions. If we look again at the column id from df DataFrame, Spark will bring a expression as a result, and not the values hold by this column.

df.id

Column<'id'>

Having these ideas in mind, when I created the column ID on the previous section, I created a “column expression”. This means that col("ID") is just an expression, and as consequence, Spark does not know which are the values of column ID, or, where it lives (which is the DataFrame that this column belongs?). For now, Spark is not interested in this information, it just knows that we have an expression referring to a column called ID.

These ideas relates a lot to the lazy aspect of Spark that we talked about in Section 3.5. Spark will not perform any heavy calculation, or show you the actual results/values from you code, until you trigger the calculations with an action (we will talk more about these “actions” on Section 5.2). As a result, when you access a column, Spark will only deliver a expression that represents that column, and not the actual values of that column.

This is handy, because we can store our expressions in variables, and, reuse it latter, in multiple parts of our code. For example, I can keep building and merging a column with different kinds of operators, to build a more complex expression. In the example below, I create a expression that doubles the values of ID column:

expr1 = id_column * 2
print(expr1)

Column<'(ID * 2)'>

Remember, with this expression, Spark knows that we want to get a column called ID somewhere, and double its values. But Spark will not perform that action right now.

Logical expressions follow the same logic. In the example below, I am looking for rows where the value in column Name is equal to 'Anne', and, the value in column Grade is above 6.

Again, Spark just checks if this is a valid logical expression. For now, Spark does not want to know where are these Name and Grade columns. Spark does not evaluate the expression, until we ask for it with an action:

expr2 = (col('Name') == 'Anne') & (col('Grade') > 6)
print(expr2)

Column<'((Name = Anne) AND (Grade > 6))'>

4.3 Literal values versus expressions

We know now that columns of a Spark DataFrame have a deep connection with expressions. But, on the other hand, there are some situations that you write a value (it can be a string, a integer, a boolean, or anything) inside your pyspark code, and you might actually want Spark to intepret this value as a constant (or a literal) value, rather than a expression.

As an example, lets suppose you control the data generated by the sales of five different stores, scattered across different regions of Belo Horizonte city (in Brazil). Now, lets suppose you receive a batch of data generated by the 4th store in the city, which is located at Amazonas Avenue, 324. This batch of data is exposed below:

path = './../Data/sales.json'
sales = spark.read.json(path)
sales.show(5)

+-----+----------+------------+-------+-------------------+-----+
|price|product_id|product_name|sale_id|          timestamp|units|
+-----+----------+------------+-------+-------------------+-----+
| 3.12|       134| Milk 1L Mua| 328711|2022-02-01T22:10:02|    1|
| 1.22|       110|  Coke 350ml| 328712|2022-02-03T11:42:09|    3|
| 4.65|       117|    Pepsi 2L| 328713|2022-02-03T14:22:15|    1|
| 1.22|       110|  Coke 350ml| 328714|2022-02-03T18:33:08|    1|
| 0.85|       341|Trident Mint| 328715|2022-02-04T15:41:36|    1|
+-----+----------+------------+-------+-------------------+-----+

If you look at this batch… there is no indication that these sales come from the 4th store. In other words, this information is not present in the data, is just in your mind. It certainly is a very bad idea to leave this data as is, whithout any identification of the source of it. So, you might want to add some labels and new columns to this batch of data, that can easily identify the store that originated these sales.

For example, we could add two new columns to this sales DataFrame. One for the number that identifies the store (4), and, another to keep the store address. Considering that all rows in this batch comes from the 4th store, we should add two “constant” columns, meaning that these columns should have a constant value across all rows in this batch. But, how can we do this? How can we create a “constant” column? The answer is: by forcing Spark to interpret the values as literal values, instead of a expression.

In other words, I can not use the col() function to create these two new columns. Because this col() function receives a column name as input. It interprets our input as an expression that refers to a column name. This function does not accept some sort of description of the actual values that this column should store.

4.4 Passing a literal (or a constant) value to Spark

So how do we force Spark to interpret a value as a literal (or constant) value, rather than a expression? To do this, you must write this value inside the lit() (short for “literal”) function from the pyspark.sql.functions module.

In other words, when you write in your code the statement lit(4), Spark understand that you want to create a new column which is filled with 4’s. In other words, this new column is filled with the constant integer 4.

With the code below, I am creating two new columns (called store_number and store_address), and adding them to the sales DataFrame.

from pyspark.sql.functions import lit
store_number = lit(4).alias('store_number')
store_address = lit('Amazonas Avenue, 324').alias('store_address')

sales = sales\
    .select(
        '*', store_number, store_address
    )

sales\
    .select(
        'product_id', 'product_name',
        'store_number', 'store_address'
    )\
    .show(5)

+----------+------------+------------+--------------------+
|product_id|product_name|store_number|       store_address|
+----------+------------+------------+--------------------+
|       134| Milk 1L Mua|           4|Amazonas Avenue, 324|
|       110|  Coke 350ml|           4|Amazonas Avenue, 324|
|       117|    Pepsi 2L|           4|Amazonas Avenue, 324|
|       110|  Coke 350ml|           4|Amazonas Avenue, 324|
|       341|Trident Mint|           4|Amazonas Avenue, 324|
+----------+------------+------------+--------------------+

In essence, you normally use the lit() function when you want to write a literal value in places where Spark expects a column name. In the example above, instead of writing a name to an existing column in the sales DataFrame, I wanted to write the literal values 'Amazonas Avenue, 324' and 4, and I used the lit() function to make this intention very clear to Spark. If I did not used the lit() function, the withColumn() method would interpret the value 'Amazonas Avenue, 324' as an existing column named Amazonas Avenue, 324.

4.5 Key methods of the `Column` class

Because many transformations that we want to apply over our DataFrames are expressed as column transformations, the methods from the Column class will be quite useful on many different contexts. You will see many of these methods across the next chapters, like desc(), alias() and cast().

Remember, you can always use the dir() function to see the complete list of methods available in any python class. Is always useful to check the official documentation too¹. There you will have a more complete description of each method.

But since they are so important in Spark, lets just give you a brief overview of some of the most popular methods from the Column class (these methods will be described in more detail in later chapters):

desc() and asc(): methods to order the values of the column in a descending or ascending order (respectively);
cast() and astype(): methods to cast (or convert) the values of the column to a specific data type;
alias(): method to rename a column;
substr(): method that returns a new column with the sub string of each value;
isNull() and isNotNull(): logical methods to test if each value in the column is a null value or not;
startswith() and endswith(): logical methods to search for values that starts with or ends with a specific pattern;
like() and rlike(): logical methods to search for a specific pattern or regular expression in the values of the column;
isin(): logical method to test if each value in the column is some of the listed values;

https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.html ↩︎

4.1 Building a column object

4.2 Columns are strongly related to expressions

4.3 Literal values versus expressions

4.4 Passing a literal (or a constant) value to Spark

4.5 Key methods of the Column class

4.5 Key methods of the `Column` class