Note:
I like to phrase things as punchy commands but it's all just my preference. My personal style is constantly evolving.
My motto is a twist on the Zen of Python
If the implementation is hard to understand, it's bad.
If the implementation is easy to understand, it may be good.
Style Guide
List comprehensions are your bread and butter. Structure them like a SQL query.
cleaned_emails = [
email.strip()
for email
in emails
if '@test.com' not in email
]
Give the The Numpy docstring style a try.
def clean_emails(emails):
"""Clean up emails from source database
Parameters
----------
emails : list
Raw email addresses
Returns
-------
cleaned_emails : list
Cleaned email addresses
"""
cleaned_emails = [
email.strip()
for email
in emails
if '@test.com' not in email
]
return cleaned_emails
Naming things is hard. Do it well! Conserve your colleagues' brainpower: if a function has a good name and docstring then they don't have to understand how it works to understand what it does.
Use SQL where possible, it's more readable than the PySpark equivalent.
user_df = spark.sql("""
SELECT id,
created_at,
dt AS partition
FROM users
WHERE updated_at IS NOT NULL
""")
Distinguish between Spark DataFrames and Pandas DataFrames when mixing the two.
user_pdf = user_df.toPandas()
Other Spark preferences: Distinguish between Spark DataFrames and Pandas DataFrames when mixing the two.
import pyspark
import pyspark.sql.types as T
import pyspark.sql.functions as F
As the documentation says, the preferred entry point for Spark is through spark
.
# Yes
df = spark.read.parquet()
rdd = spark.sparkContext.sequenceFile()
df = spark.sql("""SELECT * FROM users""")
# No
df = sc.read.parquet()
sparkContext
df = sqlContext.sql("""SELECT * FROM users""")
df = hiveContext.sql("""SELECT * FROM users""")
Writing Scripts
Start with functions. If you find yourself passing several parameters between functions then make a class.
Let's say you have a PySpark script called anonymize_emails.py
which needs a date and a database.
import sys
import pyspark
if __name__ == '__main__':
_, database, date = sys.argv
spark = pyspark.sql.SparkSession.builder\
.appName('Anonymize Emails for {}'.format(date))\
.getOrCreate()
anonymize_emails(spark, database, date)
Explicitly ignore extra parameters if you want.
_, database, date, *_ = sys.argv
I'm not a huge fan of the trailing backslash for new lines but it's common in Spark and growing on me. For Pandas I still prefer to use parentheses. In any case put each operation on new line.
(df[['user_id', 'time_on_page']]
.groupby('user_id')
.mean()
.to_csv('average_time_on_page.csv')
)
Jupyter Notebooks are your friend. Especially for data science you'll want to iterate: see what's in the data, check the processing did what you think it did, and see how long each chunk of code takes to run. Super helpful for development! What's a natural chunk of code? A function, of course.
Once everything is working clean up the notebook and add some nice comments and descriptive text. Now you're ready to transfer this code to a proper script and deploy it.