Have you ever wondered, how you're going to apply all the fancy formulas you have learned or are going to learn in your studies?
This is the right place to start, if you are unsure where to begin. And the good news is: how well you learn to apply your theoretical knowledge depends solely on you!
Generally speaking, you can either repeat the same calculations with varying numbers over and over again or you choose the lazier, but more efficien method of telling a computer how to do the job. Of course the latter option requires time investment in the beginning due to the calories you have to feed your brain to systematically analyze the problem at hand. Your computer has a hard time understanding instructions from a text like this, but if you formulate a problem with a set of instructions of any language your computer can understand, you will behold true magic. You will suddenly be able to condense hours, days or even weeks of thinking into a few lines of code and execute tasks with ease in a fraction of the time you would need without a computer. And even better, during this process, your programming toolkit will grow and you will eventually become more and more efficient.
Excel can tackle a variety of different tasks, that's why you'll come across the software in almost every office environment. But with increasingly complex analyses of bigger datasets, the Excel workflow quickly becomes less reliable and manageable. This is not to say you shouldn't use Excel at all. It is useful AND expected to know Excel in the Finance industry. Use it to handle the easy tasks, but as soon as you need two or more lines of Excel formulas in a cell, you'd probably be better off handling the task in a more manageable framework, such as R or Python.
Do the easy tasks in Excel. Use Python (or R) for anything else.
It depends! In an academic setting R is really popular and loved by many scientists for the vast range of statistical packages. Python, however, is dominant in a business setting as it is easy to learn and and convinces with its lean yet powerful code structure. Python has an edge over R to produce stability in big and complex coding projects.
At this point you should have set up your IDE and opened the program. Your Python console should now display something similar to the following message:
Now you're ready to go! You can click the button below to access our GitHub repository with a sample code showing you the very basic, but essential tools you will be using all time. If you want a guided walk-through of the code with some intuitive explanations, continue reading this post where we dissect the code more.
Start the Python for Finance Quickstart
You can use the code as-is to explore some Python basics if you already have some experience in other coding langugaes. However, if you just started with coding altogether, we offer you a guided tour through the code on how to:
These are the most basic, yet the most essential and most used functionalities you will need for coding in finance. The rest builds upon these and are mostly specialty modules and high-level computer science pieces of information.which are well beyond the scope of this quick-start and will be covered in later blog posts. Therefore, let's start from ground up!
Let's open a fresh .py file and just start programming. You'll learn the theory later.
Write all lines of code into the .py file per default as opposed to writing them only into into the console. That way, it is easier to keep track of what you did in the end of the process and without information being lost.
Start by a little comment to summarize what this .py file is about.
Seriously, if there is something worse than poorly-written, hard-coded script, then it is uncommented code. Chances are you are going to re-use your own code a lot and that somebody will have to read your code as well. Maybe you have a had a great idea how to solve a problem efficiently or you are doing some manipulation for a later point in your code, which isn't immediately obvious. If somebody reads your code (including you), they would basically have to arrive at the idea again if you would not have commented it properly. This is time wasted.
Generally speaking, more comments is better than few.
Import librariesWhen coding, you rely a lot on code that other people have written. What you may know as package from R, is called library in Python. A library is a collection of codes or modules of codes that we can use in our own programs.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
The above import statement imports one of the most popular libraries into your programming environment: pandas. By using the instruction as, you can assign a shortname to the library, so that when you call some function from that library, you don't have the write the whole name, but only its shortname. The common convention for pandas is pd and it is useful to adhere to the convention, because more often than not, you will be recycling code from Stackoverflow or other resources, where people adhere to these conventions. That way, you can just copy it into your code without needing to change anything.
Once you have imported the libraries you need, you’re ready to start importing some finance data with the pandas package, (as we recall, named pd) using the following statement.
raw_prices = pd.read_excel('spy_nasdaq_prices.xlsx', header=0)
In the statement above, you state that you want to use a function of the pandas package by typing pd and a dot. After the dot, you can write the name of the function that you want to use, in this case the read_excel function. Within the parantheses, you have to specify the parameters that are needed by the function. Refer to the function definition of read_excel to see all possible arguments you can pass to this function.
The pd.read_excel function reads an excel file into a pandas DataFrame, which is a special pandas object.
Its output looks as follows:
We see, that the dataframe consists of 3209 rows and 3 columns. Also, there is an index on the left hand side indicating the order of the data.
You may have noticed this looking at the pandas dataframe object above!
One option to access the elements of a pandas dataframe is through the pandas function iloc.
Using iloc, you can select elements based on their (integer) position in the dataframe . Since Python indexing starts at zero, the first data in the dataframe above is at location (0;0) - in the first row and in the first column.
Generally, you can access the n'th last element of a list by the integer -n. To get the element of the dataframe above in the 1st last ros and in the first column, we can therefore use the follwing command.
Access the last date:
Let's do a quick check whether our first date is smaller, i.e.older than our last date. If the statement following the word "if" is True, then the following lines will be executed. In our case, a message appears saying that data is in correct order. If the statement is false, it executes the lines after "else".
Important note Nr.1: In Python, indentation is extremely important, in the sense that your code won't work properly if you don't indent. Indentation is the blank space created by pressing "Tab" (or decreased by "Shift+Tab") in the new line after certain commands. This list isn't exhaustive, but you will most often need to use indentations with:
Most languages end this statements by the word "end/end if" (Matlab, VBA) or by wrapping these statements in curly brackets {} (R). In Python, end of the indent signifies end of the given statement. It might need a little getting-used-to, but ultimately, it creates a cleaner-looking code (less "end" statements and brackets) and also forces you to indent and create a natural structure to your code, which is good practice anyway.
Functions are an extremely versatile tool in a coder's repertoire and they allow you to make your code more compact, efficient and modular. There are already thousands and thousands of pre-made functions, from simple sums and divisions to complicated machine learning model compilers.
Despite this massive selection, you will certainly have a very specific problem which you can solve in a few lines of code. However, let's say you need to repeat the operation more than once - it would be inefficient to write the same lines of code over and over to do the operation. Moreover, it is quite error-prone. Lastly, the code becomes unnecessarily cluttered and lengthier than needed.
Let's take our previous example of checking whether the dataframe index is in the correct order and if it is not, try to flip it. We start writing a function with "def(...):". Inside the brackets, we will write our "inputs", i.e. arguments which the function will accept and use.
We named the function "check_flip", which is its name and we will be calling the function, using this name. The function's job will be to check whether the dataframe is in the correct order and if not, then to flip it. The input argument is a dataframe, which will be called "df" inside the the function.
Now we indent to signify what will happen inside the function. The input dataframe "df" will be read and as we did before, we read the first element in the first column with .iloc (we know that dates are saved in column 0) and save it as a variable "first_date". Note: this variable stays only within this function unless exported.
We do the same for the last element of the dataframe's first column, saving it as "last_date".
Now we finish the function and do the check we did before. If the condition that the first date is smaller, i.e. older than last_date, a statement is printed and the original input "df" is copied into a new variable "df_to_return", which as then name suggest, will be exported. As the order is correct, nothing needs to be changed or modified.
On the other hand, if the condition is not met, we print a statement that the order is not correct and needs to be flipped. Moreover, we now use the same logic and flip the order of the input "df". This flipped dataframe is also saved into the "df_to_return". Note that the if/else statement is exhaustive, meaning only one of the two cases can be met and the "df_to_return" will be assigned only once and thus will not be overwritten.
Last important statement in the function is "return" - it says which of the variables inside the function shall be returned. In our case, it will be the modified, either already correct or flipped dataframe.
Now our function is ready! To be able to call and use the function later,don't forget to execute these lines as follows:
sas n aniaces_flipped_from_functistudies..k_flip(raw_prices)daysddaysdsdays days 'Data is in a incorrect order, needs to be flipped!''Data is in an incorrect order, needs to be flipped!''Data is in a incorrect order, returning flipped df''Data is in an incorrect order, returning flipped df'
Let's examine a little bit what's happening here. On the right side, we call "check_flip" which is the name of our function. Inside the brackets, there is the loaded dataframe "raw_prices" which we loaded before; it is the input of our function - in the function definition, it is the "df" variable
We are now going manipulate our freshly loaded dataset so that we can actually work with it.
When we initially loaded our dataset with the command,
we didn't specify what our index column should be. In the case of financial time series data, we want the date to be the index of our dataframe. Either we set the index with
We can then easily subset our data by certain timeframes. Suppose we want to use the 2nd of February 2012 as the start date of the dataframe:
Using the loc function, we can subset the DataFrame from the new_start date until the the last available observation with the following command:
Let's calculate the returns and print a basic summary.
We rely on the matplotlib library to plot any type of graph in python. Find a beginner cheatsheet here. We import what we need with the following statement:
For-Loops are one of the most basic and most important concepts in programming. They are used for iterating over a sequence. This can either be a list, a tuple, a dictionary, a set, or a string. Using the for-loop we can execute a set of statements, once for each item in a sequence. We illustrate the use of for-loops with the computing of Moving Averages (MA). The most basic form of MA's is the simple MA (SMA) which is calculated by taking the arithmetic mean of a set of prices over a specified period n.
Let's investigate this in detail. We want to calculate the SMA series of the prices in our dataset df_prices_cut. We choose the number of time periods n to be 120. It is clear, that we will lose the first 120 observations, since the first SMA value in the SMA series is the arithmetic average of 120 sequential values. After calculating the first SMA value, we need to repeat the same task moving one row (one day) ahead. We use a for loop for this:
Congrats! We have employed our first own for-loop. Let's elaborate in detail what this loop is doing.
With
we tell python to repeat the task for every value i in the sequence of days starting at 120 and ending at the last observation in the dataset df_prices_cut.
By indenting the next line, we tell python that this line contains the task to be repeated. On the LHS of the equation, we instruct our computer to fill the i th row in the dataset with the calculated SMA value on the RHS. To calculate the average over the first 120 days, we first have to specify what exactly the first 120 observations are. We want to start at the first observation (python index = 0) and
For the first iteration of the for-loop, this is equal to:
Attention! This selects all 120 values of integer position 0 to 119! This is python specific: when specifying a range of numbers 0:n, n is not at index location n, but at n-1.
However, the first observation in our SMA dataset is
and represents the 121 th observation of df_prices_cut.
Dictionaries are used to store data values in key:value pairs. Put very simply, you can think of them as lists where individual elements have names. Why is that even useful, you may ask. Imagine you have a vector of daily weights which you determined by some rule for some stock $ABC of length 252, so by definition it is a 1 dimensional object. Now 2 new stocks, $XYZ and $JKL, come into play with their weights you determined - so you merge, or stack these three vectors to create a matrix, so a two dimensional object. In practice, that would be most likely a dataframe. We can use such dataframe to dot multiply it with a returns dataframe to obtain observed returns.
Let us now say you want to keep these three same stocks, dbut you came up with many more rules how to determine their weights - but where to do you store them?! You have several options:
So what now? Well, we can first utilise list. A list is basically a collection of other elements, whatever they are; dataframes, numbers, strings or other lists. It's like a wardrobe and each drawer of that wardrobe can store some things - socks, shirts, but also other wardrobes! That way, you can create a tree-like structures of storing data.
Back to our example - let's say you have the three stocks in a dataframe. Now you create hundred rules on how to determine those weights, so you are left with 100 individual dataframes. If one dataframe would be a sheet of paper, then putting them all into a list would be like a stack of papers you put into your printer. That is already quite nice, the only problem is accessing the individual dataframes. Or in the case of our stack of paper, you would have to remember where each sheet of paper is in order to get the right one you want.
And that's where dictionaries come into play - just like a book, you not only stack the papers together, but you can name the individual chapters so that you can easily find them! The inherent advantage of this property is that you can also name it dynamically in a for-loop.
Let's dig into one example here:
This dictionary allows us to select different specifications of the same calculation. By typing
where 'MA_240_days' is a key of the dictionary, we get the value of the dictionary, which is a pandas dataframe in this case.
Coding is one of the areas where people like sharing their work and helping others online. This is a valuable trait, as it makes learning and exploring new ideas incomparably easier and more fun. In fact, the amount of open-source code, guides, tutorials etc. published for free on the internet is so vast that you can become an adept coder using only these resources. In the following, you will find a list of resources we personally use daily and find exceptionally useful.
Stackoverflow is an absolutely essential resource no coder works without. It is a forum where one can post their coding-related questions which are answered with tailor-made solutions by other fellow forum members. It is incredibly helpful because:
Sometimes the hardest part of finding the right answer is formulating the question itself; you might be looking for "how to index/slice in Python" without knowing that it is called indexing and slicing. That takes a little practice as well, but the point is - if you don't already, create a free account and be active! This will collect you points, make your profile more credible which might lead to quicker answer to your problem.
Github is an outstanding portal used for file sharing and version control not only when working in a team, but also alone. Think of Google Documents, where you can simultaneously work on a file with your colleagues - but for coding. It is of course much deeper than that with all its functionalities, but to summarise, it allows you to:
Many coders share their projects freely on Github as well (as we do with projects published here) so that you may benefit by simply pulling (downloading). them and trying them out. Moreover, it can serve as your project portfolio to show during interviews for your future Quant-Finance job :-)
Lastly, Github also has a blog/troubleshooting functionality, thus, in addition to Stackoverflow, you might sometimes find answers to your problems there.
Medium is an online blog platform with a subsection called "Towards Data Science", where you can find posts from fellow coders, data scientists or enthusiasts. The nice part is that more often than not, you will find whole codes in the articles going through some project on the topic of finance, statistics, machine learning or general data analysis: Example of PCA applied to stock data.
There is a monthly limit though and some articles are premium only, but in our opinion, it is absolutely worth the $5 a month (Disclaimer: we are not affiliated with Medium.com in any way).
O'Reilly is a book publisher with focus on programming languages. They have a great collection to choose from and each books contains hundreds of lines of code with applied examples, such as this one specifically written about Python for Finance.
Package documentations are a tremendous, albeit somewhat more complex resource for coding, especially for beginners. In a package documentation, you will find:
Example: let's say you have a pandas dataframe including only values of 0 and 1. You would like to flip those; i.e. 0's should become 1's and 1's should become 0's. You know that there is a function .replace() in the pandas package, but are unsure how to tell to function what to replace with what.
You head over to the pandas package documentation and find that there are many ways - one of them is supplying a dictionary, like so:
And voilà! We have flipped 0's and 1's.