Python for Finance Quickstart

November 30, 2022
By
Finance Club
Coding for Finance

Why use programming in finance?

Have you ever wondered, how you're going to apply all the fancy formulas you have learned or are going to learn in your studies?

This is the right place to start, if you are unsure where to begin. And the good news is: how well you learn to apply your theoretical knowledge depends solely on you!

Generally speaking, you can either repeat the same calculations with varying numbers over and over again or you choose the lazier, but more efficien method of telling a computer how to do the job. Of course the latter option requires time investment in the beginning due to the calories you have to feed your brain to systematically analyze the problem at hand. Your computer has a hard time understanding instructions from a text like this, but if you formulate a problem with a set of instructions of any language your computer can understand, you will behold true magic. You will suddenly be able to condense hours, days or even weeks of thinking into a few lines of code and execute tasks with ease in a fraction of the time you would need without a computer. And even better, during this process, your programming toolkit will grow and you will eventually become more and more efficient.

Okay, but I learned Excel in my Studies. That does the job.

Excel can tackle a variety of different tasks, that's why you'll come across the software in almost every office environment. But with increasingly complex analyses of bigger datasets, the Excel workflow quickly becomes less reliable and manageable. This is not to say you shouldn't use Excel at all. It is useful AND expected to know Excel in the Finance industry. Use it to handle the easy tasks, but as soon as you need two or more lines of Excel formulas in a cell, you'd probably be better off handling the task in a more manageable framework, such as R or Python.

Do the easy tasks in Excel. Use Python (or R) for anything else.

What makes more sense in the Finance Industry? R or Python?

It depends! In an academic setting R is really popular and loved by many scientists for the vast range of statistical packages. Python, however, is dominant in a business setting as it is easy to learn and and convinces with its lean yet powerful code structure. Python has an edge over R to produce stability in big and complex coding projects.

Getting Python - quickstart

  1. Download Python 3.9 for your operating system.
  2. Install any IDE that you like. Suggested from our team: Pycharm or Spyder

At this point you should have set up your IDE and opened the program. Your Python console should now display something similar to the following message:

Now you're ready to go! You can click the button below to access our GitHub repository with a sample code showing you the very basic, but essential tools you will be using all time. If you want a guided walk-through of the code with some intuitive explanations, continue reading this post where we dissect the code more.

Start the Python for Finance Quickstart

Want a guided tour through the code?

You can use the code as-is to explore some Python basics if you already have some experience in other coding langugaes. However, if you just started with coding altogether, we offer you a guided tour through the code on how to:

  • load libraries
  • import datasets
  • perform basic operations on your data
  • use if else conditions
  • write your own functions
  • create for loops
  • calculate returns and other metrics
  • plot your results

These are the most basic, yet the most essential and most used functionalities you will need for coding in finance. The rest builds upon these and are mostly specialty modules and high-level computer science pieces of information.which are well beyond the scope of this quick-start and will be covered in later blog posts. Therefore, let's start from ground up!

Initiating the project

Let's open a fresh .py file and just start programming. You'll learn the theory later.

Write all lines of code into the .py file per default as opposed to writing them only into into the console. That way, it is easier to keep track of what you did in the end of the process and without information being lost.

Start by a little comment to summarize what this .py file is about.

Comment your code. This will help your future self and others to understand what you did

Seriously, if there is something worse than poorly-written, hard-coded script, then it is uncommented code. Chances are you are going to re-use your own code a lot and that somebody will have to read your code as well. Maybe you have a had a great idea how to solve a problem efficiently or you are doing some manipulation for a later point in your code, which isn't immediately obvious. If somebody reads your code (including you), they would basically have to arrive at the idea again if you would not have commented it properly. This is time wasted.

Generally speaking, more comments is better than few.

# Purpose of this file: Rocking Python
# This is a comment. (Initiated by #)
x = 1 + 1  
# On the left hand side of the hastag is executable code
# Everything on the right hand side of # is a comment

Import libraries

When coding, you rely a lot  on code that other people have written. What you may know as package from R, is called library in Python. A library is a collection of codes or modules of codes that we can use in our own programs.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The above import statement imports one of the most popular libraries into your programming environment: pandas. By using the instruction as, you can assign a shortname to the library, so that when you call some function from that library, you don't have the write the whole name, but only its shortname. The common convention for pandas is pd and it is useful to adhere to the convention, because more often than not, you will be recycling code from Stackoverflow or other resources, where people adhere to these conventions. That way, you can just copy it into your code without needing to change anything.

Load some data

Once you have imported the libraries you need, you’re ready to start importing some finance data with the pandas package, (as we recall, named pd) using the following statement.

raw_prices = pd.read_excel('spy_nasdaq_prices.xlsx', header=0)

In the statement above, you state that you want to use a function of the pandas package by typing pd and a dot. After the dot, you can write the name of the function that you want to use, in this case the read_excel function. Within the parantheses, you have to specify the parameters that are needed by the function. Refer to the function definition of read_excel to see all possible arguments you can pass to this function.

The pd.read_excel function reads an excel file into a pandas DataFrame, which is a special pandas object.

Its output looks as follows:

Out[1]:      Date         QQQ         SPY
0    2022-09-29  269.570007  360.294891
1    2022-09-28  279.940002  370.529999
2    2022-09-27  274.480011  363.380005
3    2022-09-26  274.369995  364.309998
4    2022-09-23  275.510010  367.950012        
...         ...         ...
3204 2010-01-07   40.945557   89.534698
3205 2010-01-06   40.918961   89.158302
3206 2010-01-05   41.167282   89.095604
3207 2010-01-04   41.167282   88.860374
3208 2009-12-31   40.573093   87.378456
[3209 rows x 3 columns]

We see, that the dataframe consists of 3209 rows and 3 columns. Also, there is an index on the left hand side indicating the order of the data.

Python indexing starts at zero!

You may have noticed this looking at the pandas dataframe object above!

One option to access the elements of a pandas dataframe is through the pandas function iloc.

Using iloc, you can select elements based on their (integer) position in the dataframe . Since Python indexing starts at zero, the first data in the dataframe above is at location (0;0) - in the first row and in the first column.

# use index locator (iloc) of 0 (first entry) to see what first date is
first_date = raw_prices.iloc[0,0]
first_date
Out[22]: Timestamp('2022-09-29 00:00:00')

Generally, you can access the n'th last element of a list by the integer -n. To get the element of the dataframe above in the 1st last ros and in the first column, we can therefore use the follwing command.

Access the last date:

# use index locator (iloc) of -1 (last entry) to see what last date is
last_date = raw_prices.iloc[-1,0]
last_dateOut
[27]: Timestamp('2009-12-31 00:00:00')

if: else:

Let's do a quick check whether our first date is smaller, i.e.older than our last date. If the statement following the word "if" is True, then the following lines will be executed. In our case, a message appears saying that data is in correct order. If the statement is false, it executes the lines after "else".

# make a quick check whether data is in the right order
if first_date < last_date:    
		print('Data is in the correct order')
else:    
		print('Data is in the incorrect order, needs to be flipped!')

Important note Nr.1: In Python, indentation is extremely important, in the sense that your code won't work properly if you don't indent. Indentation is the blank space created by pressing "Tab" (or decreased by "Shift+Tab") in the new line after certain commands. This list isn't exhaustive, but you will most often need to use indentations with:

  • "if" statements such as the one below
  • "for" and "while" loops
  • function definitions

Most languages end this statements by the word "end/end if" (Matlab, VBA) or by wrapping these statements in curly brackets {} (R). In Python, end of the indent signifies end of the given statement. It might need a little getting-used-to, but ultimately, it creates a cleaner-looking code (less "end" statements and brackets) and also forces you to indent and create a natural structure to your code, which is good practice anyway.

Writing a function

Functions are an extremely versatile tool in a coder's repertoire and they allow you to make your code more compact, efficient and modular. There are already thousands and thousands of  pre-made functions, from simple sums and divisions to complicated machine learning model compilers.

Despite this massive selection, you will certainly have a very specific problem which you can solve in a few lines of code. However, let's say you need to repeat the operation more than once - it would be inefficient to write the same lines of code over and over to do the operation. Moreover, it is quite error-prone. Lastly, the code becomes unnecessarily cluttered and lengthier than needed.

Let's take our previous example of checking whether the dataframe index is in the correct order and if it is not, try to flip it. We start  writing a function with "def(...):". Inside the brackets, we will write our "inputs", i.e. arguments which the function will accept and use.

We named the function "check_flip", which is its name and we will be calling the function,  using  this name. The function's job will be to check whether the dataframe is in the correct order and if not, then to flip it. The input argument is a dataframe, which will be called "df" inside the the function.

def check_flip(df):    
		# note: assumes index is in the first (0th) column of dataframe    
		first_date = df.iloc[0,0]    
		last_date = df.iloc[-1,0]

Now we indent to signify what will happen inside the function. The input dataframe "df" will be read and as we did before, we read the first element in the first column with .iloc (we know that dates are saved in column 0) and save it as a variable "first_date". Note: this variable stays only within this function unless exported.

We do the same for the last element of the dataframe's first column, saving it as "last_date".

def check_flip(df):    
		# note: assumes index is in the first (0th) column of dataframe    
		first_date = df.iloc[0,0]    
		last_date = df.iloc[-1,0]   
    
if first_date < last_date:        
		print('Data is in the correct order')        
    df_to_return = df.copy()    
else:        
		print('Data is in the incorrect order, returning flipped df') 
    df_to_return = df.iloc[::-1]   
return df_to_return

Now we finish the function and do the check we did before. If the condition that the first date is smaller, i.e. older than last_date, a statement is printed and the original input "df" is copied into a new variable "df_to_return", which as then name suggest, will be exported. As the order is correct, nothing needs to be changed or modified.

On the other hand, if the condition is not met, we print a statement that the order is not correct and needs to be flipped. Moreover, we now use the same logic and flip the order of the input "df". This flipped dataframe is also saved into the "df_to_return". Note that the if/else statement is exhaustive, meaning only one of the two cases can be met and the "df_to_return" will be assigned only once and thus will not be overwritten.

Last important statement in the function is "return" - it says which of the variables inside the function shall be returned. In our case, it will be the modified, either already correct or flipped dataframe.

Now our function is ready! To be able to call and use the function later,don't forget to execute these lines as follows:

sas n aniaces_flipped_from_functistudies..k_flip(raw_prices)daysddaysdsdays days 'Data is in a incorrect order, needs to be flipped!''Data is in an incorrect order, needs to be flipped!''Data is in a incorrect order, returning flipped df''Data is in an incorrect order, returning flipped df'

Let's examine a little bit what's happening here. On the right side, we call "check_flip" which is the name of our function. Inside the brackets, there is the loaded dataframe "raw_prices" which we loaded before; it is the input of our function - in the function definition, it is the "df" variable

Some basic manipulation

We are now going manipulate our freshly loaded dataset so that we can actually work with it.

When we initially loaded our dataset with the command,

raw_prices = pd.read_excel('spy_nasdaq_prices.xlsx', header=0)

we didn't specify what our index column should be. In the case of financial time series data, we want the date to be the index of our dataframe. Either we set the index with

df_prices = raw_prices_flipped.set_index('Date')

We can then easily subset our data by certain timeframes. Suppose we want to use the 2nd of February 2012 as the start date of the dataframe:

new_start = '2012-02-02'

Using the loc function, we can subset the DataFrame from the new_start date until the the last available observation with the following command:

df_prices_cut = df_prices.loc[new_start:, :]

Let's calculate the returns and print a basic summary.

# calculate the returns and drop the first NA 
observationdf_returns = df_prices_cut.pct_change(1).dropna()
# explore the returns 
datadf_returns.describe()               

QQQ          SPY
count  2682.000000  2682.000000
mean      0.000674     0.000505
std       0.012843     0.010610
min      -0.119788    -0.109424
25%      -0.004419    -0.003501
50%       0.001146     0.000632
75%       0.007065     0.005446
max       0.084706     0.090603

Plotting a Linechart

We rely on the matplotlib library to plot any type of graph in python. Find a beginner cheatsheet here. We import what we need with the following statement:


import matplotlib.pyplot as plt

# Assuming `df_returns` is already defined
(df_returns + 1).cumprod().plot()
plt.title('NAV of SPY and QQQ indices')
plt.ylabel('NAVs')
plt.show()

for-Loops

For-Loops are one of the most basic and most important concepts in programming. They are used for iterating over a sequence. This can either be a list, a tuple, a dictionary, a set, or a string. Using the for-loop we can execute a set of statements, once for each item in a sequence. We illustrate the use of for-loops with the computing of Moving Averages (MA). The most basic form of MA's is the simple MA (SMA) which is calculated by taking the arithmetic mean of a set of prices over a specified period n.

Let's investigate this in detail. We want to calculate the SMA series of the prices in our dataset df_prices_cut. We choose the number of time periods n to be 120. It is clear, that we will lose the first 120 observations, since the first SMA value in the SMA series is the arithmetic average of 120 sequential values. After calculating the first SMA value, we need to repeat the same task moving one row (one day) ahead. We use a for loop for this:

# parametersMA_days = 12  
# number of time periods# create an empty dataframe to store the results
df_MAs = pd.DataFrame(index=df_prices_cut.index, columns= df_prices_cut.keys() + '_MA') 
# you dont't need to understand this
# Create a for loop to calculate and store MA
for i in range(MA_days, len(df_MAs)):    
    # begin writing in position 121 (python index = 120), taking the value of mean of prices from the past 120 days and go forward    
    df_MAs.iloc[i, : ] = df_prices_cut.iloc[i-MA_days:i,:].mean(axis=0)

Congrats! We have employed our first own for-loop. Let's elaborate in detail what this loop is doing.

With

for i in range(MA_days, len(df_MAs)):

we tell python to repeat the task for every value i  in the sequence of days starting at 120 and ending at the last observation in the dataset df_prices_cut.

By indenting the next line, we tell python that this line contains the task to be repeated. On the LHS of the equation, we instruct our computer to  fill the i th row in the dataset with the calculated SMA value on the RHS. To calculate the average over the first 120 days, we first have to specify what exactly the first 120 observations are. We want to start at the first observation (python index = 0) and

df_prices_cut.iloc[i-MA_days:i,:]

For the first iteration of the for-loop, this is equal to:

df_prices_cut.iloc[0:120,:] # .iloc[row index or range, col index or range]

Attention! This selects all 120 values of integer position 0 to 119! This is python specific: when specifying a range of numbers 0:n, n is not at index location n, but at n-1.

However, the first observation in our SMA dataset is

df_MAs.iloc[120, : ]

and represents the 121 th observation of  df_prices_cut.

Dictionaries

Dictionaries are used to store data values in key:value pairs.  Put very simply, you can think of them as lists where individual elements have names. Why is that even useful, you may ask. Imagine you have a vector of daily weights which you determined by some rule for some stock $ABC of length 252, so by definition it is a 1 dimensional object. Now 2 new stocks, $XYZ and $JKL, come into play with their weights you determined - so you merge, or stack these three vectors to create a matrix, so a two dimensional object. In practice, that would be most likely a dataframe. We can use such dataframe to dot multiply it with a returns dataframe to obtain observed returns.

Let us now say you want to keep these three same stocks, dbut you came up with many more rules how to determine their weights - but where to do you store them?! You have several options:

  • create a new dataframe with the three stocks for each option manually. This might fly if you      do it twice, but if you have 3, 10 or 100 options, that is simply unfeasible
  • keep putting the weights to the original dataframe, expanding the "n" dimension of the dataframe. That might work, but is extremely messy and error-prone
  • as before, we add a dimension and create a 3-dimensional array, where every increment of the "z" dimension is a separate set of weights of the three stocks. That is already close to what we are looking for, but humans have generally troubles working in 3-dimensions

So what now? Well, we can first utilise list. A list is basically a collection of other elements, whatever they are; dataframes, numbers, strings or other lists. It's like a wardrobe and each drawer of that wardrobe can store some things - socks, shirts, but also other wardrobes! That way, you can create a tree-like structures of storing data.

Back to our example - let's say you have the three stocks in a dataframe. Now you create hundred rules on how to determine those weights, so you are left with 100 individual dataframes. If one dataframe would be a sheet of paper, then putting them all into a list would be like a stack of papers you put into your printer. That is already quite nice, the only problem is accessing the individual dataframes. Or in the case of our stack of paper, you would have to remember where each sheet of paper is in order to get the right one you want.

And that's where dictionaries come into play - just like a book, you not only stack the papers together, but you can name the individual chapters so that you can easily find them! The inherent advantage of this property is that you can also name it dynamically in a for-loop.

Let's dig into one example here:

# create empty dictionary
dict_MA = dict()
# create a list of all lookbacks we want to calculate
l_lookbacks = [60,120,240]
for lookback in l_lookbacks:    
		print(lookback)    # we can save any object into a dictionary while naming it dynamically in the loop 
    dict_MA['MA_' + str(lookback) + '_days'] =                       		
    df_prices_cut.rolling(lookback).mean() 
    
# The dictionary dict_MA conceptionally looks as follows: 
dict_MA = {'MA_60_days': df_prices_cut.rolling(60).mean(),'MA_120_days': df_prices_cut.rolling(120).mean(),'MA_240_days': df_prices_cut.rolling(240).mean()}

dict_MA
Out[43]: 
{'MA_60_days':                    
QQQ         SPY 
Date                               
2012-02-02         NaN         NaN 
2012-02-03         NaN         NaN 
2012-02-06         NaN         NaN 
2012-02-07         NaN         NaN 
2012-02-08         NaN         NaN 
...                ...         ... 
2022-09-23  303.293952  398.107919 
2022-09-26  303.203831  397.917974 
2022-09-27  303.084767  397.646302 
2022-09-28  302.976346  397.481846 
2022-09-29  302.664481  397.125392  
[2683 rows x 2 columns], 
'MA_120_days':                    
QQQ         SPY 
Date                               
2012-02-02         NaN         NaN 
2012-02-03         NaN         NaN 
2012-02-06         NaN         NaN 
2012-02-07         NaN         NaN 
2012-02-08         NaN         NaN 
...                ...         ... 
2022-09-23  306.086751  402.188941 
2022-09-26  305.307082  401.450070 
2022-09-27  304.596410  400.751130 
2022-09-28  303.996329  400.149042 
2022-09-29  303.302856  399.443069  
[2683 rows x 2 columns], 
'MA_240_days':                    
QQQ         SPY 
Date                               
2012-02-02         NaN         NaN 
2012-02-03         NaN         NaN 
2012-02-06         NaN         NaN 
2012-02-07         NaN         NaN 
2012-02-08         NaN         NaN 
...                ...         ... 
2022-09-23  337.780538  424.896295 
2022-09-26  337.444902  424.634435 
2022-09-27  337.097881  424.362298 
2022-09-28  336.746157  424.089907 
2022-09-29  336.341659  423.761037  
[2683 rows x 2 columns]}

This dictionary allows us to select different specifications of the same calculation. By typing

dict_MA['MA_240_days']

where 'MA_240_days' is a key of the dictionary, we get the value of the dictionary, which is a pandas dataframe in this case.

Additional resources

Coding is one of the areas where people like sharing their work and helping others online. This is a valuable trait, as it makes learning and exploring new ideas incomparably easier and more fun. In fact, the amount of open-source code, guides, tutorials etc. published for free on the internet is so vast that you can become an adept coder using only these resources. In the following, you will find a list of resources we personally use daily and find exceptionally useful.

Stackoverflow

Stackoverflow is an absolutely essential resource no coder works without. It is a forum where one can post their coding-related questions which are answered with tailor-made solutions by other fellow forum members. It is incredibly helpful because:

  • unless you are working on frontier-level codes, AI or the like, somebody has had your exact problem 9 out of the 10 times in the past
  • it is completely free and truly broad community, thus if you have a problem you cannot solve, you may just post there (mind the rules of Stackoverflow!)

Sometimes the hardest part of finding the right answer is formulating the question itself; you might be looking for "how to index/slice in Python" without knowing that it is called indexing and slicing. That takes a little practice as well, but the point is - if you don't already, create a free account and be active! This will collect you points, make your profile more credible which might lead to quicker answer to your problem.

GitHub

Github is an outstanding portal used for file sharing and version control not only when working in a team, but also alone. Think of Google Documents, where you can simultaneously work on a file with your colleagues - but for coding. It is of course much deeper than that with all its functionalities, but to summarise, it allows you to:

  • work on a project simultaneously and efficiently share changes of your code
  • track and view changes, return to a previous version if needed
  • work on several "forks" of the same file, which can be used as different verison of a code or can be later merged into the final one
  • general structure, organization and backup of your files

Many coders share their projects freely on Github as well (as we do with projects published here) so that you may benefit by simply pulling (downloading). them and trying them out. Moreover, it can serve as your project portfolio to show during interviews for your future Quant-Finance job :-)

Lastly, Github also has a blog/troubleshooting functionality, thus, in addition to Stackoverflow, you might sometimes find answers to your problems there.

Medium / Towards Data Science

Medium is an online blog platform with a subsection called "Towards Data Science", where you can find posts from fellow coders, data scientists or enthusiasts. The nice part is that more often than not, you will find whole codes in the articles going through some project on the topic of finance, statistics, machine learning or general data analysis: Example of PCA applied to stock data.

There is a monthly limit though and some articles are premium only, but in our opinion, it is absolutely worth the $5 a month (Disclaimer: we are not affiliated with Medium.com in any way).

O'Reilly Books

O'Reilly is a book publisher with focus on programming languages. They have a great collection to choose from and each books contains hundreds of lines of code with applied examples, such as this one specifically written about Python for Finance.

Packages documentations

Package documentations are a tremendous, albeit somewhat more complex resource for coding, especially for beginners. In a package documentation, you will find:

  • all the functions included in the package and their usage
  • what arguments it accepts and what data type has to be supplied
  • simple usage examples

Example: let's say you have a pandas dataframe including only values of 0 and 1. You would like to flip those; i.e. 0's should become 1's and 1's should become 0's. You know that there is a function .replace() in the pandas package, but are unsure how to tell to function what to replace with what.

import pandas as pd
df = pd.DataFrame([0,0,1,0], index= ['A', 'B','C','D'], columns = ['Some values'])
print(df) 

Some values
A            0
B            0
C            1
D            0

You head over to the pandas package documentation and find that there are many ways - one of them is supplying a dictionary, like so:

df.replace({0:1,1:0})   

Some values
A            1
B            1
C            0
D            1

And voilà! We have flipped 0's and 1's.