Another library in the Python ecosystem is pandas (PANel DAta). This library can help us to execute five common steps in data analysis:
- Load data.
- Data preparation.
- Data manipulation.
- Data modelling.
- Data analysis.
The main panda structure is DataFrame. Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labelled axes. It is composed of three elements: the data, the index and the columns. In addition, the names of columns and indexes can be specified.
Main library characteristics
- The DataFrame object is fast and efficient.
- Tools to load data in memory from different formats.
- Data alignment and missing data management.
- Remodelling and turning date sets.
- Labelling, cut and indexation of big amounts of data.
- Columns can be removed or inserted.
- Data grouping for aggregation and transformation.
- High performance for data union and merge.
- Time-based series functionality.
- It has three main structures:
- Series: 1D structures.
- DataFrame: 2D structures.
- Panel: 3D structures.
Installing pandas
pandas library is not present in the default Python installation and it needs to be installed:
pip install -U pandas
pandas useful methods
Creating a Series
import pandas as pd
series = pd.Series({"UK": "London",
"Germany": "Berlin",
"France": "Paris",
"Spain": "Madrid"})
Creating a DataFrame
data = np.array([['', 'Col1', 'Col2'], ['Fila1', 11, 22], ['Fila2', 33, 44]])
You can find the code example here.
Without the boilerplate code:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]))
Exploring a DataFrame
- df.shape – DataFrame shape.
- len(df.index) – DataFrame high.
- df.describe() – DataFrame numeric statistics (count, mean, std, min, 25%, 50%, 75%, max).
- df.mean() – Return the mean of the values for the requested axis.
- df.corr() – Correlation of columns.
- df.count() – Count of non-null values per column.
- df.max() – Maximum value per column.
- fd.min() – Minimum per column.
- df.median() – Median value per column.
- df.std() – Standard deviation per column.
- df[0] – Select a DataFrame column (returned as a new DataFrame).
- df[1, 2] – Select two DataFrame columns (returned as a new DataFrame).
- df.iloc[0][2] – Select a value.
- df.loc([0] – Select a column using the index.
- df.iloc([0, :] – Select a column using the index.
- pd.read_<file_type>() – Read from a file (pd.read_csv(‘train.csv’).
- df.to_<file_type>() – Write to a file (pd.to_csv(‘new_train.csv’)).
- df.isnull() – Verify is there are null values in the data set.
- df.isnull().sum() – Return the sum of null values per column in the data set.
- df.dropna() or df.dropna(axis = 1) – Remove rows or columns with missing data.
- df.fillna(x) – Replace missing values with x (df.fillna(df.mean())).
And, this is all. This has been just a quick, very quick, review of the pandas library. I just recommend you to play around a bit more but, we will use it more in the future.