4  Introducing the Column class

As we described at the introduction of Chapter 3, you will massively use the methods from the DataFrame class in your Spark applications to manage, modify and calculate your Spark DataFrames.

However, there is one more python class that provides some very useful methods that you will regularly use, which is the Column class, or more specifically, the pyspark.sql.column.Column class.

The Column class is used to represent a column in a Spark DataFrame. This means that, each column of your Spark DataFrame is a object of class Column.

We can confirm this statement, by taking the df DataFrame that we showed at Section 3.4, and look at the class of any column of it. Like the id column:

type(df.id)
pyspark.sql.column.Column

4.1 Building a column object

You can refer to or create a column, by using the col() and column() functions from pyspark.sql.functions module. These functions receive a string input with the name of the column you want to create/refer to.

Their result are always a object of class Column. For example, the code below creates a column called ID:

from pyspark.sql.functions import col
id_column = col('ID')
print(id_column)
Column<'ID'>

4.3 Literal values versus expressions

We know now that columns of a Spark DataFrame have a deep connection with expressions. But, on the other hand, there are some situations that you write a value (it can be a string, a integer, a boolean, or anything) inside your pyspark code, and you might actually want Spark to intepret this value as a constant (or a literal) value, rather than a expression.

As an example, lets suppose you control the data generated by the sales of five different stores, scattered across different regions of Belo Horizonte city (in Brazil). Now, lets suppose you receive a batch of data generated by the 4th store in the city, which is located at Amazonas Avenue, 324. This batch of data is exposed below:

path = './../Data/sales.json'
sales = spark.read.json(path)
sales.show(5)
+-----+----------+------------+-------+-------------------+-----+
|price|product_id|product_name|sale_id|          timestamp|units|
+-----+----------+------------+-------+-------------------+-----+
| 3.12|       134| Milk 1L Mua| 328711|2022-02-01T22:10:02|    1|
| 1.22|       110|  Coke 350ml| 328712|2022-02-03T11:42:09|    3|
| 4.65|       117|    Pepsi 2L| 328713|2022-02-03T14:22:15|    1|
| 1.22|       110|  Coke 350ml| 328714|2022-02-03T18:33:08|    1|
| 0.85|       341|Trident Mint| 328715|2022-02-04T15:41:36|    1|
+-----+----------+------------+-------+-------------------+-----+

If you look at this batch… there is no indication that these sales come from the 4th store. In other words, this information is not present in the data, is just in your mind. It certainly is a very bad idea to leave this data as is, whithout any identification of the source of it. So, you might want to add some labels and new columns to this batch of data, that can easily identify the store that originated these sales.

For example, we could add two new columns to this sales DataFrame. One for the number that identifies the store (4), and, another to keep the store address. Considering that all rows in this batch comes from the 4th store, we should add two “constant” columns, meaning that these columns should have a constant value across all rows in this batch. But, how can we do this? How can we create a “constant” column? The answer is: by forcing Spark to interpret the values as literal values, instead of a expression.

In other words, I can not use the col() function to create these two new columns. Because this col() function receives a column name as input. It interprets our input as an expression that refers to a column name. This function does not accept some sort of description of the actual values that this column should store.

4.4 Passing a literal (or a constant) value to Spark

So how do we force Spark to interpret a value as a literal (or constant) value, rather than a expression? To do this, you must write this value inside the lit() (short for “literal”) function from the pyspark.sql.functions module.

In other words, when you write in your code the statement lit(4), Spark understand that you want to create a new column which is filled with 4’s. In other words, this new column is filled with the constant integer 4.

With the code below, I am creating two new columns (called store_number and store_address), and adding them to the sales DataFrame.

from pyspark.sql.functions import lit
store_number = lit(4).alias('store_number')
store_address = lit('Amazonas Avenue, 324').alias('store_address')

sales = sales\
    .select(
        '*', store_number, store_address
    )

sales\
    .select(
        'product_id', 'product_name',
        'store_number', 'store_address'
    )\
    .show(5)
+----------+------------+------------+--------------------+
|product_id|product_name|store_number|       store_address|
+----------+------------+------------+--------------------+
|       134| Milk 1L Mua|           4|Amazonas Avenue, 324|
|       110|  Coke 350ml|           4|Amazonas Avenue, 324|
|       117|    Pepsi 2L|           4|Amazonas Avenue, 324|
|       110|  Coke 350ml|           4|Amazonas Avenue, 324|
|       341|Trident Mint|           4|Amazonas Avenue, 324|
+----------+------------+------------+--------------------+

In essence, you normally use the lit() function when you want to write a literal value in places where Spark expects a column name. In the example above, instead of writing a name to an existing column in the sales DataFrame, I wanted to write the literal values 'Amazonas Avenue, 324' and 4, and I used the lit() function to make this intention very clear to Spark. If I did not used the lit() function, the withColumn() method would interpret the value 'Amazonas Avenue, 324' as an existing column named Amazonas Avenue, 324.

4.5 Key methods of the Column class

Because many transformations that we want to apply over our DataFrames are expressed as column transformations, the methods from the Column class will be quite useful on many different contexts. You will see many of these methods across the next chapters, like desc(), alias() and cast().

Remember, you can always use the dir() function to see the complete list of methods available in any python class. Is always useful to check the official documentation too1. There you will have a more complete description of each method.

But since they are so important in Spark, lets just give you a brief overview of some of the most popular methods from the Column class (these methods will be described in more detail in later chapters):

  • desc() and asc(): methods to order the values of the column in a descending or ascending order (respectively);
  • cast() and astype(): methods to cast (or convert) the values of the column to a specific data type;
  • alias(): method to rename a column;
  • substr(): method that returns a new column with the sub string of each value;
  • isNull() and isNotNull(): logical methods to test if each value in the column is a null value or not;
  • startswith() and endswith(): logical methods to search for values that starts with or ends with a specific pattern;
  • like() and rlike(): logical methods to search for a specific pattern or regular expression in the values of the column;
  • isin(): logical method to test if each value in the column is some of the listed values;

  1. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Column.html↩︎