Numpy¶

David E. Bernal Neira
Davidson School of Chemical Engineering, Purdue University

Numpy¶

Table of Contents¶

Initializing Numpy Arrays
Array Indexing
Array Operator Behavior
Array Methods
Array functions

Tip: Click any link to jump to that section! Make sure your section headers use Markdown headings (e.g., ## Array Indexing) for anchor links to work in Jupyter. For repeated headings, Jupyter appends -1, -2, etc. to the anchor.

# If using this on Google Colab, we need to install the packages
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

If you are using Google Colab you should save this notebook and any associated text files to their own folder on your Google Drive. Then you will need to adapt the following commands so that the notebook runs from the location of that folder. This is only necessary for the brief section on reading text files into Python.

# If you want to use Google Drive to save/load files, set this to True
USE_GOOGLE_DRIVE = False
if IN_COLAB and USE_GOOGLE_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')

    # Colab command to navigate to the folder holding the homework,
    # CHANGE FOR YOUR SPECIFIC FOLDER LOCATION IN Google Drive
    # Note: if there are spaces in the path, you need to precede them with a backslash '\'
    %cd /content/drive/My\ Drive/CHE597/Lectures/2-Numpy

Introduction to Numpy¶

Numpy is a ubiquitous module in scientific Python applications that supplies an important class called an array. Arrays are essentially numerical implementations of tensors (i.e., vectors, matrices, and other higher dimensional numerical objects). In this section we will see the basic attributes and methods associated with numpy arrays.

Historical note: SciPy is a very common module that also includes numpy. This is because numpy preceded scipy, and scipy builds on it. If you are looking up things online you will often see them used interchangeably, but the basic distinction is that SciPy contains a lot of additional functions and numerical methods that are not part of numpy. There is no point in importing the much larger scipy module, if all you are doing is using numpy arrays and their associated functionality.

In the following sections keywords you should know and in-text code will be presented as keywords and code, respectively.

Initializing Numpy Arrays¶

Initializing a numpy array is as simple as importing the library and passing a list to the np.array() constructor (the class __init__, looks like a function):

import numpy as np
a = np.array(range(9))
print(a)

[0 1 2 3 4 5 6 7 8]

Note that because we’ve imported numpy as .np, we access the associated objects in numpy through the np. namespace. The array constructor accepts a list containing a single datatype, and has an optional argument for the datatype dtype:

b = np.array(range(9),dtype=np.float32)
print(b)

[0. 1. 2. 3. 4. 5. 6. 7. 8.]

When we supply the type np.float32, the list of integers returned by range(5) are upcast to numpy 32-bit floats (Note: built-in Python floats are stored as 64-bit numbers, you can use np.float64 if you need this additional precision but it doubles the memory cost of each number). Arrays only hold data of a single type, if you try to supply mixed numerical types it will always upcast to the highest level required to represent all of the objects:

c = np.array([1,2.0,1+1j])
d = np.array([1,2.0,"3"])
print(c)
print(d)

[1.+0.j 2.+0.j 1.+1.j]
['1' '2.0' '3']

In the first example all of the numbers are recast as complex, in the second they are recast as strings.

Arrays can also be made by passing tuples:

d = np.array((1,2,3))
print(d)

[1 2 3]

You won’t get an error if you pass sets and dictionaries to np.array(), but their behavior won’t be array-like. You can also initialize arrays to have multiple dimensions, by initializing them with lists of lists:

e = np.array([1,2]) # 1D array (vector)
f = np.array([[1,2],[3,4]]) # 2D array (matrix)
g = np.array([[[1,2],[3,4]],[[5,6],[7,8]]]) # 3D array(tensor)
print("1D:\n{}".format(e))
print("\n2D:\n{}".format(f))
print("\n3D:\n{}".format(g))

1D:
[1 2]

2D:
[[1 2]
 [3 4]]

3D:
[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]

Numpy also comes with several built-in functions for initializing arrays:

zeros([n_rows,n_cols,...],dtype=float64) can be used to initialize an array of zeroes with the specified shape and type.
ones([n_rows,n_cols,...],dtype=float64) can be used to initialize an array of ones with the specified shape and type.
full([n_rows,n_cols,...],value) can be used to initialize an array of values with the specified shape and type.
arange(start=0,stop,step=1,dtype=int64) can be used to initialize a vector with evenly spaced between start and stop with a spacing of step, excluding the stop value but including the start value. start and step are optional.
linspace(start=0,stop,num=50) can be used to initialize a vector with a specify number of values between the start and stop values. Similar to arange but you specified the number of values rather than the step.

Their behavior is more clearly illustrated with examples:

h = np.zeros([2,2])
i = np.ones([2,2])
j = np.full([2,2],100.)
k = np.arange(10)
l = np.linspace(1,2,num=11)
print(h)
print(i)
print(j)
print(k)
print(l)

[[0. 0.]
 [0. 0.]]
[[1. 1.]
 [1. 1.]]
[[100. 100.]
 [100. 100.]]
[0 1 2 3 4 5 6 7 8 9]
[1.  1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2. ]

Numpy has even more functions for initializing arrays for particular purposes, but these are the most common.

Lastly, Numpy also has a relatively flexible built-in function called loadtxt(filename,comments='#',delimiter=' ',skiprows=0,...) for initializing arrays from suitably delimited files:

import os
if not os.path.exists('sample.txt'):
    !wget -q https://raw.githubusercontent.com/SECQUOIA/PU_CHE597_DSinChemE/main/2-Numpy/sample.txt -O sample.txt
m = np.loadtxt('sample.txt',comments='!',skiprows=1)
print(m)

[[ 0.    12.71 ]
 [ 1.    14.224]
 [ 2.    20.238]
 [ 3.    18.982]]

genfromtxt() is another numpy function with similar arguments that can also apply simple rules for missing values:

import os
if not os.path.exists('sample2.txt'):
    !wget -q https://raw.githubusercontent.com/SECQUOIA/PU_CHE597_DSinChemE/main/2-Numpy/sample2.txt -O sample2.txt
n = np.genfromtxt('sample2.txt',comments='!',skip_header=1,filling_values=100)
print(n)

[[  0.     12.71 ]
 [  1.     14.224]
 [  2.    100.   ]
 [  3.     18.982]]

Check your understanding¶

What does dtype control when you create an array?
When would you prefer linspace over arange?

# Mini-exercise
# Create a float32 array with values 0 to 4 and print its dtype.
arr = None  # TODO
if arr is not None:
    print(arr.dtype)
# Expected output: float32

Array Indexing¶

You can access values from arrays using the same slicing notation that we use for lists:

m = np.arange(10)
print(m[2::2])

[2 4 6 8]

For multi-dimensional arrays, you can use sequential [] notation to access specific elements and rows. The first index corresponds to rows, the second to columns, and so on. For example array[3][::2] would return every other column from the 4th row. In sequential [] notation, each slice is returned before the next operates. Here are some examples:

n = np.array([[1,2],[3,4]])
print(n)
print(n[1][1])
print(n[0])

[[1 2]
 [3 4]]
4
[1 2]

If you want to return specific columns or other dimensions you would use [row,col,...] notation. This is distinct from the sequential notation in that the slicing operation is carried out in a single step. This is necessary for returning columns. Consider the differing results of the following slicing operations:

o = np.array([[1,2],[3,4]])
print(o)
print(o[:,1]) # Example 1
print(o[:][1]) # Example 2

[[1 2]
 [3 4]]
[2 4]
[3 4]

In the first example the second column is returned, but in the second example the second row is returned. The latter result occurs because the first slicing operation o[:] just returns the full array, with the result that the second slicing operation ([1]) returns the second row.

Finally, arrays are distinct from lists in that you can pass a list of indices to arrays via the [] notation to return values in that order:

print(o[[1,0]])
print(o[[1,0,1],[1,0,0]])

[[3 4]
 [1 2]]
[4 1 3]

In the first case, the rows are returned in order specified in the list. In the second case where two lists are passed, the first list specifies the row locations and the second list specifies the column locations (i.e., the o[1][1], o[0][0], and o[1][0] elements). This second case is a useful reference for below where the np.where() function is discussed.

Check your understanding¶

How is a[1][2] different from a[1,2]?
What does : mean in a slice?

# Mini-exercise
# Create a 3x3 array and print the middle row and middle column.
a = np.arange(9).reshape(3,3)
# TODO: set these to the correct slices
middle_row = None
middle_col = None
if middle_row is not None:
    print(middle_row)
if middle_col is not None:
    print(middle_col)
# Expected output: [3 4 5] and [1 4 7] (order may vary)

Array Operator Behavior¶

Numpy arrays are defined to behave within typical operators (+,-,*,**,/,%,>,==) in element-wise fashion. Thus, a*b does not mean dot product, it means element-wise product between the arrays pointed to by a and b:

p = np.arange(4)
print("p: {}".format(p))
print("p+p: {}".format(p+p))
print("p-p: {}".format(p-p))
print("p*p: {}".format(p*p))
print("p**p: {}".format(p**p))
print("p/p: {}".format(p/p))
print("p%p: {}".format(p%p))
print("p>p: {}".format(p>p))
print("p==p: {}".format(p==p))

p: [0 1 2 3]
p+p: [0 2 4 6]
p-p: [0 0 0 0]
p*p: [0 1 4 9]
p**p: [ 1  1  4 27]
p/p: [nan  1.  1.  1.]
p%p: [0 0 0 0]
p>p: [False False False False]
p==p: [ True  True  True  True]

/tmp/ipython-input-2143240182.py:7: RuntimeWarning: invalid value encountered in divide
  print("p/p: {}".format(p/p))
/tmp/ipython-input-2143240182.py:8: RuntimeWarning: divide by zero encountered in remainder
  print("p%p: {}".format(p%p))

Since operators act element-wise, they only make sense when a and b point to arrays of the same shape (see broadcasting behavior in numpy docs for more details on when operations can be performed on arrays of different size). You can also apply these operators between arrays and scalars, in which case the same scalar is applied to every element of the array in the operation:

print("p: {}".format(p))
print("p+1.0: {}".format(p+1.0))
print("p-1.0: {}".format(p-1.0))
print("p*2.0: {}".format(p*2.0))
print("p**2.0: {}".format(p**2.0))
print("p/2.0: {}".format(p/2.0))
print("p%2.0: {}".format(p%2.0))
print("p>2: {}".format(p>2))
print("p==2: {}".format(p==2))

p: [0 1 2 3]
p+1.0: [1. 2. 3. 4.]
p-1.0: [-1.  0.  1.  2.]
p*2.0: [0. 2. 4. 6.]
p**2.0: [0. 1. 4. 9.]
p/2.0: [0.  0.5 1.  1.5]
p%2.0: [0. 1. 0. 1.]
p>2: [False False False  True]
p==2: [False False  True False]

Numpy also introduces a new operator, @, for the inner product between two arrays:

print("p@p: {}".format(p@p))

p@p: 14

This is equivalent to the array.dot() method and np.dot() function described below.

Check your understanding¶

What does * do for arrays?
What does the @ operator compute?

# Mini-exercise
# Compute element-wise product and dot product.
a = np.array([1,2,3])
b = np.array([4,5,6])
elem = None  # TODO
dot = None  # TODO
if elem is not None:
    print(elem)
if dot is not None:
    print(dot)
# Expected output: [4 10 18] and 32

Array Methods¶

Numpy arrays come with a number of built-in methods for reshaping matrices, performing linear-algebra operations, calculating statistics, sorting, and performing logical operations.

Reshaping¶

When we initialize an array, we can reorganize the elements using the .reshape(dim1,dim2,...) method to convert a vector into a matrix, etc. and vice versa:

# Example 1: Using reshape to increase dimensions
q = np.arange(9).reshape(3,3)
print(q)

# Example 2: Using reshape to reduce dimensions
print(q.reshape(9))

[[0 1 2]
 [3 4 5]
 [6 7 8]]
[0 1 2 3 4 5 6 7 8]

Example 1: We create a 1-d array with arange(9) and immediately reshape it to a square 3x3 matrix. Note that you can chain numpy methods together just like you can any class method.
Example 2: We reshape the 3x3 matrix back to a 1D vector. Note that numpy counts objects by row, then column, etc. when reducing dimensions.

Linear Algebra¶

One of the major motivations for developing numpy was to provide Python users with access to objects that behave like tensors. The numpy arrays thus supplies methods for dot product, transpose, and complex conjugate:

r = np.linspace(1+1J,4+4J,num=4).reshape(2,2)
print(r)
print(r.dot(r))
print(r.dot(r.transpose()))
print(r.dot(r.transpose().conj())) # Hermitian

[[1.+1.j 2.+2.j]
 [3.+3.j 4.+4.j]]
[[0.+14.j 0.+20.j]
 [0.+30.j 0.+44.j]]
[[0.+10.j 0.+22.j]
 [0.+22.j 0.+50.j]]
[[10.+0.j 22.+0.j]
 [22.+0.j 50.+0.j]]

More linear algebra operations can be applied using built in numpy functions (described in the following section) that accept arrays as inputs.

Statistics¶

Arrays come with built-in methods for calculating means, sums, products, and standard deviations along rows/columns:

Note: See statistics routines: https://numpy.org/doc/stable/reference/routines.statistics.html

# Mean Examples
r = np.arange(4).reshape(2,2)
print("r:\n{}".format(r))
print("\nmean examples:")
print(r.mean())
print(r.mean(axis=0))
print(r.mean(axis=1))

# Sum Examples
print("\nsum examples:")
print(r.sum())
print(r.sum(axis=0)) # over rows
print(r.sum(axis=1)) # over columns

# Product Examples
print("\nprod examples:")
print(r.prod())
print(r.prod(axis=0)) # over rows
print(r.prod(axis=1)) # over columns

# Stdev Examples
print("\nstd examples:")
print(r.std())
print(r.std(axis=0)) # over rows
print(r.std(axis=1)) # over columns

r:
[[0 1]
 [2 3]]

mean examples:
1.5
[1. 2.]
[0.5 2.5]

sum examples:
6
[2 4]
[1 5]

prod examples:
0
[0 3]
[0 6]

std examples:
1.118033988749895
[1. 1.]
[0.5 0.5]

Each of these methods accepts an optional axis argument which corresponds to which dimension you want to perform the operation over (See later section on this topic). For example, .mean(axis=0) corresponds to taking the mean of each column (since you are calculating the mean over the rows). Similar to other array related commands, axis interprets 0 as rows, 1 as columns, etc. When axis is not specified, each method is calculated with respect to all elements in the array.

Sorting¶

Arrays come with built-in methods for sorting values and returning the indices of the sorted array:

# Initialize 2D array
s = np.array([[4,1],[2,3]])
print("s:\n{}\n".format(s))

# Example 1: Sort rows
s = np.array([[4,1],[2,3]])
s.sort() # Same as axis=1
print("\ns.sort():\n{}".format(s))

# Example 2: Sort columns
s = np.array([[4,1],[2,3]])
s.sort(axis=0)
print("\ns.sort(axis=0):\n{}".format(s))

s:
[[4 1]
 [2 3]]


s.sort():
[[1 4]
 [2 3]]

s.sort(axis=0):
[[2 1]
 [4 3]]

-Example 1: The .sort() method modifies the existing array and does not return a new array. The .sort() method accepts an optional axis argument, by default this is set to -1 (i.e., the last dimension of the array). In this case, axis=1, so the sorting is across the columns and each row has the same values but in ascending order.

-Example 2: Here axis=0, so the sorting is across the rows and each column has the same values but in ascending order.

Arrays also come with the .argsort() method that returns the indices that would sort the array, this can be useful when you want to sort an array by the values in a particular column or row:

t = np.array([[3,0.1],[1,0.2],[2,0.2]])
inds = t[:,0].argsort()
print("t:\n{}\n".format(t))
print("inds:\n{}\n".format(inds))
print("t[inds]:\n{}".format(t[inds]))

t:
[[3.  0.1]
 [1.  0.2]
 [2.  0.2]]

inds:
[1 2 0]

t[inds]:
[[1.  0.2]
 [2.  0.2]
 [3.  0.1]]

In this example, suppose that the first column in the array were time and we wanted to sort the rest of values in each row so that the times are ordered. Here t[:,0].argsort() returns a list of indices (inds) that sorts the times. The array can then be called with these indices (t[inds]) to return the array sorted by the time values.

Logical Methods¶

When you want to use logical operators on arrays like > and == you might want to know if “any” value in the array is “greater than” or “equal to” a value, respectively. Alternatively, you might want to know if “all” values in the array satisfy the comparison. The .all() and .any() methods are used for this purpose:

u = np.array([3,3,4,5])
print((u>3).any())
print((u>3).all())

True
False

Example 1: (u>3) returns an array of booleans which is passed to .any() for evaluation. .any() returns True because 4 and 5 are greater than 3.
Example 2: (u>3) returns an array of booleans which is passed to .all() for evaluation. .all() returns False because the first two elements are False (i.e., not “all” of the booleans are True).

Check your understanding¶

What happens to the element order when you reshape?
What does axis=0 mean for a 2D array?

# Mini-exercise
# Reshape a vector and compute column means.
a = np.arange(6)
m = None  # TODO
col_means = None  # TODO
if m is not None:
    print(m)
if col_means is not None:
    print(col_means)
# Expected output: a 2x3 array and its 3 column means

Array functions¶

Even though Numpy is the “light-weight” version of “scipy” it comes packed with a lot of useful functions for arrays relevant to finding specific elements and performing additional linear algebra operations.

np.where()¶

The np.where(condition) function is a flexible and optimized function for finding the indices in an array where a condition is True:

v = np.array([[3,1],[2,20]])
row_inds,col_inds = np.where(v>1)
print(row_inds,col_inds) 
print(v[np.where(v>1)]) # np.where() result can also be used directly

[0 1 1] [0 0 1]
[ 3  2 20]

np.where(v>1) returns a tuple of two arrays with the row and column positions corresponding to where v>1. The tuple contains two arrays because v is two dimensional. For example, on a vector np.where only returns one array:

w = np.array([1,2,3,4])
print(np.where(w == 1))
print(w[np.where(w == 1)])

(array([0]),)
[1]

np.where(condition,truevals,falsevals) is a second use case of np.where() that is used to generate a new array based on values from truevals and falsevals. Specifically, the new array will have dimensions equal to condition. truevals and falsevals must either be constants or arrays of the same dimension as condition. Here are examples of both uses:

truevals = np.array([[1,2],[3,4]])
falsevals = np.array([[-1,-2],[-3,-4]])
a = np.where(truevals>1,truevals,falsevals)
b = np.where(truevals<3,truevals,100)
print(a)
print(b)

[[-1  2]
 [ 3  4]]
[[  1   2]
 [100 100]]

Linear Algebra¶

Some array methods can be used to perform linear algebra operations, but a more general set is available through the numpy functions. cross(), dot(), matmul(), outer() all correspond to the expected matrix operations:

# cross, dot, and outer product examples
x = np.array([1,0,0])
y = np.array([0,1,0])
print(np.cross(x,y))
print(np.dot(x,y))
print(np.matmul(x,y)) # NOTE: this behaves the same as dot() for vectors and matrices. For 3D it interprets the array as a list of matrices where the matrices are stored in the last two dimensions. 
print(np.outer(x,y)) # analogous to dot(x.T,y)

[0 0 1]
0
0
[[0 1 0]
 [0 0 0]
 [0 0 0]]

Axis argument¶

In typical use cases there can be the need to only apply numpy methods/functions along a subset of an array. For example, you might want to find the maximum value in each row of an array, or each column of an array, rather than over the whole array. Likewise, we already saw the example above of calculating sums and means across columns and rows.

Whenever it makes sense for a method or function to operate across rows or columns, numpy uses an optional axis argument to control the behavior. In most cases the default behavior is to perform the operation over the whole array, whereas axis=0 means to perform the operation across rows, and axis=1 means to perform the operation across columns. You will only get an intuition for this from practice, but below are some illustrative examples:

a = np.arange(25).reshape(5,5)
print(a)
print("sum over array: {}".format(a.sum()))
print("sum over rows: {}".format(a.sum(axis=0)))
print("sum over columns: {}".format(a.sum(axis=1)))
print("maximum value over the array: {}".format(a.max()))
print("maximum value in each column: {}".format(a.max(axis=0)))
print("maximum value in each row: {}".format(a.max(axis=1)))

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]
sum over array: 300
sum over rows: [50 55 60 65 70]
sum over columns: [ 10  35  60  85 110]
maximum value over the array: 24
maximum value in each column: [20 21 22 23 24]
maximum value in each row: [ 4  9 14 19 24]

Check your understanding¶

What does np.where return for a 2D array?
How does the axis argument change a reduction?

# Mini-exercise
# Replace negative values with 0 using np.where.
a = np.array([-2, -1, 0, 1, 2])
b = None  # TODO
if b is not None:
    print(b)
# Expected output: [0 0 0 1 2]

References¶

NumPy User Guide: https://numpy.org/doc/stable/user/index.html
Array creation: https://numpy.org/doc/stable/reference/routines.array-creation.html
Indexing: https://numpy.org/doc/stable/user/basics.indexing.html
Broadcasting: https://numpy.org/doc/stable/user/basics.broadcasting.html
I/O: https://numpy.org/doc/stable/reference/routines.io.html
Statistics: https://numpy.org/doc/stable/reference/routines.statistics.html