Python NumPy Array Tutorial
Today we will be talking about one useful library that is used in data science. Now, why do we need a library and can’t use Python itself? Well, we can actually use Python itself, but once you get familiar with working in NumPy
, you will see the difference.
Table Of Contents
1. What is NumPy?
NumPy is the core library and foundation for data science. It works great with multidimensional array objects. The library’s name is short for “Numerical Python”. So basically what we do with it is solve on a computer mathematical models of problems in Science and Engineering. It’s far more efficient just to use this library instead of writing your own in Python, because it’s already got all features you need. So let’s get started, shall we?
1.1 Installing NumPy
You need to install NumPy before working with it. It’s not a pre-installed library in your computer. By now, you need to have at least Python installed in your computer. We will do the most common way of installing which is pip
.
If you are Windows user, try this:
Windows installation
pip3 install numpy
Note that it may now work, so there are 2 solutions: either add Python to the PATH environment, or check if you have pip
installed. If you don’t, then try this:
Windows installation
pip install pip --upgrade pip --version
Then you need to download wheel from the Internet and install it. Then you will be ready to go. Now, if you are a Linux user, this one will help: Go to command line by thisCtrl
+Alt
+T
or any other combination you have in your distribution. Then type:
Linux installation
sudo pip3 install numpy
That’s it! Now you are ready to code!
2. Working with arrays
2.1. Arrays manipulation
We already know what an array is. It’s a grid of values of the same type. An array represents any regular data in a structured way. For example, you may have an array of strings, integers, booleans, etc. Let’s create one:
Python Shell
>>> my_array = [1,2,3] >>> print(my_array) [1, 2, 3] >>> my_2d_array = [4,[5,6],7] >>> print(my_2d_array) [4, [5, 6], 7] >>> my_3d_array = [8,[9,[10,11],12],13] >>> print(my_3d_array) [8, [9, [10, 11], 12], 13]
As we can see, we have got couple arrays representing integers. However, if we dive into Computer Science, an array actually contains more than just elements. It contains information about the raw data, where and how to locate elements in the array and how to interpret them. Let’s talk about 4 basic aspects of manipulating arrays:
data
is a pointer that shows the memory address of the first byte in the array. It’s important to know it before you do any manipulations with itdtype
is a data type pointer that shows what kind of elements are displayed in the given arrayshape
indicates the shape of the given array. It can be multidimensional, so it’s wise to know about the shapestrides
is something really tricky. It shows how many bytes are needed to be skipped in order to hop on a different element in the given array. It’s a pretty confusing explanation. Imagine that you just can’t get access to another element and you just don’t see it, unless you pay an effort to skip some bytes. It’s more of a pointer logic, but it’s really important to understand it well
2.2. NumPy arrays
In order to create a numpy array we need to use the np.array()
function. It’s a common practice to import numpy as np
. That’s what we will do as well in this article. Let’s create couple arrays:
array.py
import numpy as np a = np.array([1,2,3,4,5]) print(type(a)) #outputs <class 'numpy.ndarray'> print(a.shape) #outputs (3,) print(a.dtype) #outputs int64 print(a.strides) #outputs (8,) print(a.data) #check the output for yourself!
So it’s all pretty simple when it comes to accessing elements and initializing arrays. But what happens if we want to create arrays?
numpy_array.py
import numpy as np #create an array of all zeros a = np.zeros((3,3)) print(a) ''' [[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]] ''' #create an array of all ones b = np.ones((2,3)) print(b) ''' [[ 1. 1. 1.] [ 1. 1. 1.]] ''' #create a constant array with a custom shape and custom value c = np.full((4,3),2) print(c) ''' [[2 2 2] [2 2 2] [2 2 2] [2 2 2]] ''' #create an identity matrix with a custom shape d = np.eye(4) print(d) ''' [[ 1. 0. 0. 0.] [ 0. 1. 0. 0.] [ 0. 0. 1. 0.] [ 0. 0. 0. 1.]] ''' #create random array e = np.random.random((3,3)) #it might print something similar to somewhat below: ''' [[ 0.59678947 0.89766843 0.04795142] [ 0.1575911 0.54953419 0.21916215] [ 0.69233153 0.99744842 0.89032515]] '''
2.3. Indexing
We can manipulate arrays in many ways. One of them is called slicing. Let’s try to modify an array with modifying sliced sub-array only.
indexing.py
import numpy as np #create an array a = np.array([[1,2,3,4,5], [6,7,8,9,10],[11,12,13,14,15]]) print(a) ''' [[ 1 2 3 4 5] [ 6 7 8 9 10] [11 12 13 14 15]] ''' #Now let's obtain sub-array consisting of 3x3 array in the middle of the given array b = a[:3, 1:4] print(b) ''' [[ 2 3 4] [ 7 8 9] [12 13 14]] ''' #as we modify b array, we actually modify the given a array. Let's make all elements there equal zeros for i in range(3): for j in range(3): b[i,j] = 0 print(b) ''' [[0 0 0] [0 0 0] [0 0 0]] ''' #now let's see how our first array changed: print(a) ''' [[ 1 0 0 0 5] [ 6 0 0 0 10] [11 0 0 0 15]] '''
Now you may be wondering why did the given array change? If so, it’s a good question! Since a slice of an array is a view into the same data, so modifying it will change the original array. Makes sense, right?
So there is another thing which can be quite confusing. You may extract the very similar data but Python will say it’s different. Let me show you the code first, and then I will explain what is happening there:
row_col.py
import numpy as np #create a 3x3 array a = np.array([[1,2,3], [4,5,6], [7,8,9]]) ''' [[1 2 3] [4 5 6] [7 8 9]] ''' #example with rows row_r1 = a[1, :] row_r2 = a[1:2, :] print(row_r1, row_r1.shape) # [4 5 6] (3,) print(row_r2, row_r2.shape) # [[4 5 6]] (1, 3) #example with columns col_r1 = a[:, 1] col_r2 = a[:, 1:2] print(col_r1, col_r1.shape) #[2 5 8] (3,) print(col_r2, col_r2.shape) ''' [[2] [5] [8] (3, 1) '''
Alright, time for the explanation! We can mix integer indexing with slices yields an array of a lower rank, while using only slices yields an array of the same rank as the original array. We can do the same distinction when accessing columns.
Anyway, that’s quite a useful trick. Also, we can mutate elements in arrays.
arrange.py
import numpy as np #let's create 4x4 array a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12], [13,14,15,16]]) print(a) ''' [[ 1 2 3 4] [ 5 6 7 8] [ 9 10 11 12] [13 14 15 16]] ''' #let's create an array of indexes b = np.array([0,1,2,3]) print(b) #[0 1 2 3] #for each row we extract the element of b's elements indexes print(a[np.arrange(4), b]) #[ 1 6 11 16] ''' So we basically iterate the first row and look for 0th element (b[0] = 0) and it's 1 (a[0][0] = 1). Then we do the same for the rest of rows: (a[1][1] = 6, a[2][2] = 11, a[3][3] = 16 ''' #let's mutate some elements of the array: a[np.arrange(4), b] = 0 print(a) ''' [[ 0 2 3 4] [ 5 0 7 8] [ 9 10 0 12] [13 14 15 0]] '''
Another cool feature is a boolean indexing. We can pick out arbitrary elements of an array. It is used to pick elements which satisfy some if-else
statements.
bool.py
import numpy as np a = np.array([[1,2,3], [4,5,6], [7,8,9]]) #let's find all elements that are greater than 5 bool_idx = (a>5) #let's print them! print(bool_idx) #what if we want to know indexes of those elements? No problem! print(a[bool_idx]) #we can actually do it altogether print(a[a>5])
2.4. Datatypes
Do you know why NumPy is great? It tries to guess a datatype when you create an array, but functions that construct arrays also include (usually) an optional argument to explicitly specify the datatype.
datatype.py
import numpy as np a = np.array([1,2]) print(a.dtype) #int64 a = np.array([1.1, 2.2]) print(a.dtype) #float64 a = np.array([1,2], dtype = np.int64) print(a.dtype) #int64
So note that we can actually clarify what kind of data we want to put in our variables. Here is a short list of what you can do:
"?"
is a boolean"b"
is a signed byte"B"
is an unsigned byte"i"
is a signed integer"u"
is an unsigned integer"f"
is a floating-point"c"
is a complex-floating point"m"
is a timedelta"M"
is a datetime"O"
is an object"U"
is a unicode string"V"
is a raw data known asvoid
2.5. NumPy Math
You can perform mathematical functions on arrays. Let’s see how they work!
math.py
import numpy as np a = np.array([[1,2],[3,4]], dtype=np.float64) b = np.array([[5,6],[7,8]], dtype=np.float64) print(a+b) ''' [[ 6. 8.] [ 10. 12.]] ''' print(np.add(a,b)) ''' [[ 6. 8.] [ 10. 12.]] ''' print(a-b) ''' [[-4. -4.] [-4. -4.]] ''' print(np.subtract(a,b)) ''' [[-4. -4.] [-4. -4.]] ''' print(a*b) ''' [[ 5. 12.] [ 21. 32.]] ''' print(np.multiply(a,b)) ''' [[ 5. 12.] [ 21. 32.]] ''' print(a/b) ''' [[ 0.2 0.33333333] [ 0.42857143 0.5 ]] ''' print(np.divide(a,b)) ''' [[ 0.2 0.33333333] [ 0.42857143 0.5 ]] ''' print(a**b) ''' [[ 1.00000000e+00 6.40000000e+01] [ 2.18700000e+03 6.55360000e+04]] ''' print(np.sqrt(a) + np.sqrt(b)) ''' [ 3.23606798 3.86370331] [ 4.37780212 4.82842712]] '''
We can also work in NumPy with vectors. It seems obvious since we have been working with matrices for the past 10 minutes. Anyway, we can use dot
function to compute inner products of vectors, to multiply matrices and to multiply vector by a matrix. Let’s see how it works!
product.py
import numpy as np a = np.array([[1,2], [3,4]]) b = np.array([[5,6], [7,8]]) x = np.array([9,10]) y = np.array([11, 12]) #Vector/Vector product print(v.dot(w)) #219 print(np.dot(v,w)) #219 #Matrix/Vector product print(x.dot(v)) #[29 67] print(np.dot(x, v)) #[29 67] #Matrix/Matrix product print(x.dot(y)) # print(np.dot(x,y)) ''' [[19 22] [43 50]] '''
Another commonly used thing is sum
. Let’s see how it works!
sum.py
import numpy as np a = np.array([[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15], [16,17,18,19,20]]) print(np.sum(a)) #210 print(np.sum(a, axis = 0)) #[34 38 42 46 50] #it's a sum of each column above! print(np.sum(a, axis = 1)) #[15 40 65 90] #it's a sum of each row above!
Apart from computing and manipulating matrices, we can also transpose matrices. We just need to use T
method. Let’s see how it works!
transpose.py
import numpy as np a = np.array([[1,2,3,4,5], [6,7,8,9,10], [11,12,13,14,15], [16,17,18,19,20]]) print(a) ''' [[ 1 2 3 4 5] [ 6 7 8 9 10] [11 12 13 14 15] [16 17 18 19 20]] ''' print(a.T) ''' [[ 1 6 11 16] [ 2 7 12 17] [ 3 8 13 18] [ 4 9 14 19] [ 5 10 15 20]] '''
3. Broadcasting
Broadcasting is something powerful that allows you to work with arrays of different shapes when performing arithmetic operations. Suppose we have a matrix and want to add a constant static vector to each row of it. We can do something like this (even though there are many other ways to do it):
broadcasting.py
import numpy as np x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) #create an empty matrix with the same size as x y = np.empty_like(x) for i in range(4): y[i, :] = x[i, :] + v print(y) ''' [[ 2 2 4] [ 5 5 7] [ 8 8 10] [ 11 11 13]] '''
Basically what we do here is stacking multiple copies of vector vertically and performing elementwise summation. We can do it other way like this:
stack.py
import numpy as np x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) vv = np.tile(v, (4, 1)) print(vv) ''' [[ 1 0 1] [ 1 0 1] [ 1 0 1] [ 1 0 1]] ''' y = x + vv print(y) ''' [[ 2 2 4] [ 5 5 7] [ 8 8 10] [ 11 11 13]] '''
So what does broadcasting have to do with all this? Well, it actually helps you to perform computation without creating multiple copies of v
. What about this code?
no_copies.py
import numpy as np x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) y = x + v ''' This line works even though x has shape(4,3) and v has shape (3,) due to broadcasting ''' print(y) ''' [[ 2 2 4] [ 5 5 7] [ 8 8 10] [ 11 11 13]] '''
There are couple things to remember when it comes to broadcasting 2 arrays together:
- The arrays can be broadcast together if they are compatible in all dimensions
- After broadcasting, each array behaves as if it had shape equal to the elementwise maximum of shapes of the two input arrays
- Arrays are compatible in a dimension if (and only if) they have the same size in the dimension, or if one of the arrays has size 1 in that dimension
4. Summary
For now, we have a basic understanding about NumPy and how to work with NumPy arrays. Let’s sum up what we have learned, and I will also provide links to official documentation:
- You can create arrays and manipulate them, mutate elements inside and do many cool things! More info about creating arrays can be found here and more info about array manipulation is there
- You can index objects in arrays in many different ways: basic slicing, field access, advanced indexing. More about indexing is up here
- You can perform all sorts of mathematical miracles in NumPy. Actually, if you are using it and want to become a data scientist, maybe there are no miracles but calculations for you! Anyway, more info about mathematical functions is up here
- You can do broadcasting which is very powerful and makes your code run faster. More info about that is here
- For any general references it’s always wise to go to official docs to solve your problems. Surely, I should put a link to official docs which is here
5. Homework
It’s always good to practice what you have just learned. So I encourage you to solve these exercises.
- How to randomly place p elements in a 2D array?
- Find the nearest value from a given value in an array
- Considering a four dimensions array, how to get sum over the last two axis at once?
- How to swap two rows of an array?
- Compute a matrix rank
6. Download the Source Code
You can find all materials needed in the file below.
You can download the full source code of this example here: python-numpy-array.zip