NumPy and Pandas are two powerful libraries for working with data in Python. NumPy is a library for working with large, multi-dimensional arrays and matrices of numerical data, while Pandas is a library for working with tabular data, such as data stored in a spreadsheet or a database table. In this article, we’ll explore the key features and differences of NumPy and Pandas, and how to use them effectively in your Python programs.
NumPy
First, let’s take a look at NumPy. NumPy is a library that provides support for large, multi-dimensional arrays and matrices of numerical data, as well as functions to perform mathematical operations on these data. You can use NumPy to create, manipulate, and perform mathematical operations on arrays and matrices of numerical data.
To use NumPy, you’ll need to install it and import it into your Python program. You can install NumPy using pip, the Python package manager. For example:
pip install numpy
Once you’ve installed NumPy, you can import it into your Python program using the import statement. For example:
import numpy as np
NumPy arrays are similar to Python lists, but they have some important differences. NumPy arrays are more efficient and faster than Python lists, and they allow you to perform mathematical operations on the entire array, rather than just individual elements. You can create a NumPy array from a Python list by using the np.array() function. For example:
import numpy as np
# Create a NumPy array from a Python list
a = np.array([1, 2, 3])
print(a) # Output: [1 2 3]
You can access the elements of a NumPy array using the indexing and slicing operators, just like with a Python list. For example:
import numpy as np
a = np.array([1, 2, 3, 4, 5])
# Access the first element
print(a[0]) # Output: 1
# Access the last element
print(a[-1]) # Output: 5
# Access a slice of elements
print(a[1:3]) # Output: [2 3]
NumPy arrays also have some useful methods for manipulating and performing mathematical operations on the data. For example, you can use the shape() method to get the dimensions of the array, the reshape() method to change the shape of the array, and the transpose() method to transpose the array. You can also use mathematical functions like mean(), sum(), and std() to compute statistical properties of the data in the array. For example:
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
# Get the dimensions of the array
print(a.shape) # Output: (2, 3)
# Reshape the array
b = a.reshape(3, 2)
print(b) # Output: [[1 2] [3 4] [5 6]]
# Transpose the array
c = b.transpose()
print(c) # Output: [[1 3 5] [2 4 6]]
# Compute the mean of the array
mean = a.mean()
print(mean) # Output: 3.5
# Compute the sum of the array
sum = a.sum()
print(sum) # Output: 21
# Compute the standard deviation of the array
std = a.std()
print(std) # Output: 1.707825127659933
Pandas
Now let’s take a look at Pandas. Pandas is a library that provides support for working with tabular data, such as data stored in a spreadsheet or a database table. Pandas is particularly useful for working with data that has multiple columns and rows, and for performing complex operations on the data.
To use Pandas, you’ll need to install it and import it into your Python program. You can install Pandas using pip, the Python package manager. For example:
pip install pandas
Once you’ve installed Pandas, you can import it into your Python program using the import statement. For example:
import pandas as pd
Pandas provides two main data structures for working with tabular data: the DataFrame and the Series. A DataFrame is a 2-dimensional table of data with rows and columns, similar to a spreadsheet. A Series is a 1-dimensional array of data with a single column. You can create a DataFrame from a Python dictionary by using the pd.DataFrame() function, or a Series from a Python list by using the pd.Series() function. For example:
import pandas as pd
# Create a DataFrame from a dictionary
data = {'name': ['John', 'Jane', 'James'], 'age': [30, 25, 35]}
df = pd.DataFrame(data)
print(df)
# Output:
# name age
# 0 John 30
# 1 Jane 25
# 2 James 35
# Create a Series from a list
s = pd.Series([1, 2, 3])
print(s)
# Output:
# 0 1
# 1 2
# 2 3
You can access the rows and columns of a DataFrame using the indexing and slicing operators, or by using the loc[] and iloc[] methods. For example:
import pandas as pd
data = {'name': ['John', 'Jane', 'James'], 'age': [30, 25, 35]}
df = pd.DataFrame(data)
# Access the first row
print(df[0:1])
# Output:
# name age
# 0 John 30
# Access the second column
print(df['age'])
# Output:
# 0 30
# 1 25
# 2 35
# Access a row and a column using the loc[] method
print(df.loc[0, 'name'])
# Output: John
# Access a row and a column using the iloc[] method
print(df.iloc[0, 0])
# Output: John
Pandas also has a variety of functions and methods for manipulating and performing operations on the data in a DataFrame or Series. For example, you can use the head() and tail() methods to view the first and last few rows of a DataFrame, the describe() method to compute statistical properties of the data, and the groupby() method to group the data by a certain column. You can also use the apply() method to apply a function to the data, and the pivot_table() method to create a pivot table. For example:
import pandas as pd
data = {'name': ['John', 'Jane', 'James', 'John', 'Jane'], 'age': [30, 25, 35, 40, 45], 'score': [90, 80, 95, 70, 85]}
df = pd.DataFrame(data)
# Apply a function to the data
df['age_plus_one'] = df['age'].apply(lambda x: x + 1)
print(df)
# Output:
# name age score age_plus_one
# 0 John 30 90 31
# 1 Jane 25 80 26
# 2 James 35 95 36
# 3 John 40 70 41
# 4 Jane 45 85 46
# Create a pivot table
pivot = df.pivot_table(index='name', values='score', aggfunc='mean')
print(pivot)
# Output:
# score
# name
# James 95
# Jane 82.5
# John 80
Conclusion
In summary, NumPy and Pandas are powerful libraries for working with data in Python. NumPy is great for working with large, multi-dimensional arrays and matrices of numerical data, while Pandas is useful for working with tabular data and performing complex operations on the data. By learning how to use these libraries effectively, you’ll be able to write more efficient and powerful Python programs for working with data.
Exercises
To review these concepts, we will go through a series of exercises designed to test your understanding and apply what you have learned.
Write a function that takes a list of integers as an argument and returns a NumPy array of the integers.
import numpy as np
def to_array(lst):
return np.array(lst)
print(to_array([1, 2, 3])) # Output: array([1, 2, 3])
print(to_array([4, 5, 6])) # Output: array([4, 5, 6])
Write a function that takes a NumPy array as an argument and returns the shape of the array.
import numpy as np
def shape(arr):
return arr.shape
a = np.array([[1, 2, 3], [4, 5, 6]])
print(shape(a)) # Output: (2, 3)
b = np.array([1, 2, 3])
print(shape(b)) # Output: (3,)
Write a function that takes a dictionary and a list of strings as arguments and returns a Pandas DataFrame with the dictionary as the data and the list of strings as the column names.
import pandas as pd
def to_df(data, columns):
return pd.DataFrame(data, columns=columns)
data = {'name': ['John', 'Jane', 'James'], 'age': [30, 25, 35]}
df = to_df(data, ['name', 'age'])
print(df)
# Output:
# name age
# 0 John 30
# 1 Jane 25
# 2 James 35
Write a function that takes a Pandas DataFrame as an argument and returns the mean of the values in the ‘score’ column.
import pandas as pd
def mean_score(df):
return df['score'].mean()
data = {'name': ['John', 'Jane', 'James', 'John', 'Jane'], 'age': [30, 25, 35, 40, 45], 'score': [90, 80, 95, 70, 85]}
df = pd.DataFrame(data)
mean = mean_score(df)
print(mean) # Output: 84.0
Write a function that takes a Pandas DataFrame as an argument and returns a pivot table of the data with the ‘name’ column as the index and the ‘score’ column as the values.
import pandas as pd
def pivot_table(df):
return df.pivot_table(index='name', values='score', aggfunc='mean')
data = {'name': ['John', 'Jane', 'James', 'John', 'Jane'], 'age': [30, 25, 35, 40, 45], 'score': [90, 80, 95, 70, 85]}
df = pd.DataFrame(data)
pivot = pivot_table(df)
print(pivot)
# Output:
# score
# name
# James 95
# Jane 82.5
# John 80