12+ Useful PySpark functions for DataFrame transformations
PySpark has a lot of useful functions to transform and clean data, however its documentation contains very few examples of how these functions look like, this post would show their usage with some useful examples.
1. The trim() function
The trim() function ‘trims’ spaces before and after the column string values, there’s some variations of this function called ltrim() that removes spaces on the left side of the string and rtrim() that removes spaces on the right side of it.
2. The isNotNull() function
This function allows you to know whether a column value is null, you can use it to filter stuff or to validate things using the when().otherwise() function, you can also use another variation of this function named isNull().
3. The union() function
The union() function is very straightforward, it does the same thing that the ‘union’ statement in SQL does, however is good to know that you can use it in PySpark almost in the same way.
4. The concat() function
The concat() function allows you to create new data from the columns you already have ‘concatenating’ them with literals or other columns.
5. Functions to extract date info
Using year(), month() and dayofmonth() you can extract from a date column the year, the month and the day respectively.
6. The select() function
In the same way the ‘select’ statement works on SQL you can use the select() function in PySpark to create a new DataFrame with the fields you specify, this is particularly useful for ‘joins’ and ‘left joins’ for getting the fields from the table you are joining with.
7. The lower() function
The lower() function turns to lower case the values of the selected column, it’s very useful for formatting uneven data.
8. The from_utc_timestamp() function
The from_utc_timestamp() function is very useful when you need to convert a date column from a time zone to another.
9. The regexp_replace() function
The regexp_replace() function works in a similar way the replace() function works in Python, to use this function you have to specify the column, the text to be replaced and the text replacement, this function will pass through all the rows replacing the values, since this function is so expensive it can decrease the execution performance, use it wisely.
10. The coalesce() function
The coalesce() function is a very elegant function that allows to return the first column that is not null from a list of them, so if the first column you put in the coalesce() function is null, it would try to use the next one.
11. Generating ids using row_number() and rand() functions
If you need to generate continuous ids (1, 2, 3..) for random rows, you can use the row_number() and rand() functions together for assigning your id column values, you can use this trick to add ids to rows where their history is not necessary.
12. The createOrReplaceTempView() method and the spark.sql() function
The createOrReplaceTempView() method allows you to take your DataFrame and turn it into a ‘queryable’ view, in this way you can perform SQL queries on it using the spark.sql() function.