Back to Course

Intermediate Python

0% Complete
0/0 Steps
Lesson 6 of 33
In Progress

Working with Data Structures in NumPy and Pandas

NumPy and Pandas are two powerful libraries for working with data in Python. NumPy is a library for working with large, multi-dimensional arrays and matrices of numerical data, while Pandas is a library for working with tabular data, such as data stored in a spreadsheet or a database table. In this article, we’ll explore the key features and differences of NumPy and Pandas, and how to use them effectively in your Python programs.

NumPy

First, let’s take a look at NumPy. NumPy is a library that provides support for large, multi-dimensional arrays and matrices of numerical data, as well as functions to perform mathematical operations on these data. You can use NumPy to create, manipulate, and perform mathematical operations on arrays and matrices of numerical data.

To use NumPy, you’ll need to install it and import it into your Python program. You can install NumPy using pip, the Python package manager. For example:

pip install numpy

Once you’ve installed NumPy, you can import it into your Python program using the import statement. For example:

import numpy as np

NumPy arrays are similar to Python lists, but they have some important differences. NumPy arrays are more efficient and faster than Python lists, and they allow you to perform mathematical operations on the entire array, rather than just individual elements. You can create a NumPy array from a Python list by using the np.array() function. For example:

import numpy as np

# Create a NumPy array from a Python list
a = np.array([1, 2, 3])

print(a)  # Output: [1 2 3]

You can access the elements of a NumPy array using the indexing and slicing operators, just like with a Python list. For example:

import numpy as np

a = np.array([1, 2, 3, 4, 5])

# Access the first element
print(a[0])  # Output: 1

# Access the last element
print(a[-1])  # Output: 5

# Access a slice of elements
print(a[1:3])  # Output: [2 3]

NumPy arrays also have some useful methods for manipulating and performing mathematical operations on the data. For example, you can use the shape() method to get the dimensions of the array, the reshape() method to change the shape of the array, and the transpose() method to transpose the array. You can also use mathematical functions like mean(), sum(), and std() to compute statistical properties of the data in the array. For example:

import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]])

# Get the dimensions of the array
print(a.shape)  # Output: (2, 3)

# Reshape the array
b = a.reshape(3, 2)

print(b)  # Output: [[1 2] [3 4] [5 6]]

# Transpose the array
c = b.transpose()

print(c)  # Output: [[1 3 5] [2 4 6]]

# Compute the mean of the array
mean = a.mean()

print(mean)  # Output: 3.5

# Compute the sum of the array
sum = a.sum()

print(sum)  # Output: 21

# Compute the standard deviation of the array
std = a.std()

print(std)  # Output: 1.707825127659933

Pandas

Now let’s take a look at Pandas. Pandas is a library that provides support for working with tabular data, such as data stored in a spreadsheet or a database table. Pandas is particularly useful for working with data that has multiple columns and rows, and for performing complex operations on the data.

To use Pandas, you’ll need to install it and import it into your Python program. You can install Pandas using pip, the Python package manager. For example:

pip install pandas

Once you’ve installed Pandas, you can import it into your Python program using the import statement. For example:

import pandas as pd

Pandas provides two main data structures for working with tabular data: the DataFrame and the Series. A DataFrame is a 2-dimensional table of data with rows and columns, similar to a spreadsheet. A Series is a 1-dimensional array of data with a single column. You can create a DataFrame from a Python dictionary by using the pd.DataFrame() function, or a Series from a Python list by using the pd.Series() function. For example:

import pandas as pd

# Create a DataFrame from a dictionary
data = {'name': ['John', 'Jane', 'James'], 'age': [30, 25, 35]}
df = pd.DataFrame(data)

print(df)

# Output:
#    name  age
# 0  John   30
# 1  Jane   25
# 2  James  35

# Create a Series from a list
s = pd.Series([1, 2, 3])

print(s)

# Output:
# 0    1
# 1    2
# 2    3

You can access the rows and columns of a DataFrame using the indexing and slicing operators, or by using the loc[] and iloc[] methods. For example:

import pandas as pd

data = {'name': ['John', 'Jane', 'James'], 'age': [30, 25, 35]}
df = pd.DataFrame(data)

# Access the first row
print(df[0:1])

# Output:
#    name  age
# 0  John   30

# Access the second column
print(df['age'])

# Output:
# 0    30
# 1    25
# 2    35

# Access a row and a column using the loc[] method
print(df.loc[0, 'name'])

# Output: John

# Access a row and a column using the iloc[] method
print(df.iloc[0, 0])

# Output: John

Pandas also has a variety of functions and methods for manipulating and performing operations on the data in a DataFrame or Series. For example, you can use the head() and tail() methods to view the first and last few rows of a DataFrame, the describe() method to compute statistical properties of the data, and the groupby() method to group the data by a certain column. You can also use the apply() method to apply a function to the data, and the pivot_table() method to create a pivot table. For example:

import pandas as pd

data = {'name': ['John', 'Jane', 'James', 'John', 'Jane'], 'age': [30, 25, 35, 40, 45], 'score': [90, 80, 95, 70, 85]}
df = pd.DataFrame(data)

# Apply a function to the data
df['age_plus_one'] = df['age'].apply(lambda x: x + 1)

print(df)

# Output:
#    name  age  score  age_plus_one
# 0  John   30     90            31
# 1  Jane   25     80            26
# 2  James  35     95            36
# 3  John   40     70            41
# 4  Jane   45     85            46

# Create a pivot table
pivot = df.pivot_table(index='name', values='score', aggfunc='mean')

print(pivot)

# Output:
#        score
# name
# James    95
# Jane     82.5
# John     80

Conclusion

In summary, NumPy and Pandas are powerful libraries for working with data in Python. NumPy is great for working with large, multi-dimensional arrays and matrices of numerical data, while Pandas is useful for working with tabular data and performing complex operations on the data. By learning how to use these libraries effectively, you’ll be able to write more efficient and powerful Python programs for working with data.

Exercises

To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.

Write a function that takes a list of integers as an argument and returns a NumPy array of the integers.

import numpy as np

def to_array(lst):
    return np.array(lst)

print(to_array([1, 2, 3]))  # Output: array([1, 2, 3])
print(to_array([4, 5, 6]))  # Output: array([4, 5, 6])

Write a function that takes a NumPy array as an argument and returns the shape of the array.

import numpy as np

def shape(arr):
    return arr.shape

a = np.array([[1, 2, 3], [4, 5, 6]])
print(shape(a))  # Output: (2, 3)

b = np.array([1, 2, 3])
print(shape(b))  # Output: (3,)

Write a function that takes a dictionary and a list of strings as arguments and returns a Pandas DataFrame with the dictionary as the data and the list of strings as the column names.

import pandas as pd

def to_df(data, columns):
    return pd.DataFrame(data, columns=columns)

data = {'name': ['John', 'Jane', 'James'], 'age': [30, 25, 35]}
df = to_df(data, ['name', 'age'])

print(df)

# Output:
#    name  age
# 0  John   30
# 1  Jane   25
# 2  James  35

Write a function that takes a Pandas DataFrame as an argument and returns the mean of the values in the ‘score’ column.

import pandas as pd

def mean_score(df):
    return df['score'].mean()

data = {'name': ['John', 'Jane', 'James', 'John', 'Jane'], 'age': [30, 25, 35, 40, 45], 'score': [90, 80, 95, 70, 85]}
df = pd.DataFrame(data)
mean = mean_score(df)

print(mean)  # Output: 84.0

Write a function that takes a Pandas DataFrame as an argument and returns a pivot table of the data with the ‘name’ column as the index and the ‘score’ column as the values.

import pandas as pd

def pivot_table(df):
    return df.pivot_table(index='name', values='score', aggfunc='mean')

data = {'name': ['John', 'Jane', 'James', 'John', 'Jane'], 'age': [30, 25, 35, 40, 45], 'score': [90, 80, 95, 70, 85]}
df = pd.DataFrame(data)
pivot = pivot_table(df)

print(pivot)

# Output:
#        score
# name
# James    95
# Jane     82.5
# John     80