12+ Useful PySpark functions for DataFrame transformations

Julian Alexander Murillo
3 min readFeb 19, 2019

PySpark has a lot of useful functions to transform and clean data, however its documentation contains very few examples of how these functions look like, this post would show their usage with some useful examples.

1. The trim() function

The trim() function ‘trims’ spaces before and after the column string values, there’s some variations of this function called ltrim() that removes spaces on the left side of the string and rtrim() that removes spaces on the right side of it.

trim() function in action

2. The isNotNull() function

This function allows you to know whether a column value is null, you can use it to filter stuff or to validate things using the when().otherwise() function, you can also use another variation of this function named isNull().

.isNotNull() function in action

3. The union() function

The union() function is very straightforward, it does the same thing that the ‘union’ statement in SQL does, however is good to know that you can use it in PySpark almost in the same way.

union() function in action

4. The concat() function

The concat() function allows you to create new data from the columns you already have ‘concatenating’ them with literals or other columns.

concat() function in action

5. Functions to extract date info

Using year(), month() and dayofmonth() you can extract from a date column the year, the month and the day respectively.

year(), month() and dayofmonth() functions in action

6. The select() function

In the same way the ‘select’ statement works on SQL you can use the select() function in PySpark to create a new DataFrame with the fields you specify, this is particularly useful for ‘joins’ and ‘left joins’ for getting the fields from the table you are joining with.

select() function in action

7. The lower() function

The lower() function turns to lower case the values of the selected column, it’s very useful for formatting uneven data.

lower() function in action

8. The from_utc_timestamp() function

The from_utc_timestamp() function is very useful when you need to convert a date column from a time zone to another.

from_utc_timestamp() in action

9. The regexp_replace() function

The regexp_replace() function works in a similar way the replace() function works in Python, to use this function you have to specify the column, the text to be replaced and the text replacement, this function will pass through all the rows replacing the values, since this function is so expensive it can decrease the execution performance, use it wisely.

regexp_replace() function in action

10. The coalesce() function

The coalesce() function is a very elegant function that allows to return the first column that is not null from a list of them, so if the first column you put in the coalesce() function is null, it would try to use the next one.

coalesce() function in action

11. Generating ids using row_number() and rand() functions

If you need to generate continuous ids (1, 2, 3..) for random rows, you can use the row_number() and rand() functions together for assigning your id column values, you can use this trick to add ids to rows where their history is not necessary.

row_number() and rand() functions in action

12. The createOrReplaceTempView() method and the spark.sql() function

The createOrReplaceTempView() method allows you to take your DataFrame and turn it into a ‘queryable’ view, in this way you can perform SQL queries on it using the spark.sql() function.

createOrReplaceTempView() method and spark.sql() function in action

--

--