Creating a Comprehensive Education Index from Indonesia's Data
Written on
Chapter 1: Introduction to Composite Indexes
In order to develop thorough policies, decision-makers must identify key areas for enhancement. A composite index allows for the aggregation of various indicators, making it possible to rank different sectors and assess performance.
The Concept of Factor Analysis
Factor analysis (FA) is a linear statistical approach designed to simplify a set of variables by reducing them to a fewer number of underlying factors. This technique highlights the interrelationships among these variables, akin to Principal Component Analysis (PCA). By interpreting the data through factor analysis, analysts can group related variables into factors that explain specific variances. This methodology is frequently applied in areas like market research and advertising.
To dive deeper into factor analysis, consider reviewing this resource:
Hands-on Tutorial
To implement factor analysis and develop a composite index—in this case, an education index—there are several Python modules that need to be installed via the Anaconda Prompt:
- pandas: for data manipulation (slicing, aggregation, etc.)
- numpy: for linear algebra computations
- factor_analyzer: for executing factor analysis, including adequacy tests and loading factor calculations
- plotnine: for data visualization, similar to ggplot2 in R
- scikit-learn: for data standardization
- geopandas: for geospatial data visualization
- folium: for creating choropleth maps
- branca: for generating HTML and JavaScript pages using Python
Once the modules are installed, you can start a Jupyter Notebook or your preferred IDE and import the necessary libraries to proceed with the tutorial on creating a composite index.
# Importing required modules
import pandas as pd
import numpy as np
from sklearn.datasets import make_regression
from factor_analyzer import FactorAnalyzer
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, calculate_kmo
from sklearn.preprocessing import MinMaxScaler
from plotnine import *
import geopandas as gpd
import folium
import branca
import branca.colormap as cm
from folium.features import GeoJson, GeoJsonTooltip
import json
To facilitate understanding and gain insights from the data, conditional formatting will be applied using the pandas styling method. This will highlight key metrics—loading factors, communalities, and eigenvalues—resulting from the factor analysis.
# Functions for highlighting values
def highlightLoadings(x):
return ['background-color: yellow' if abs(v) > 0.5 else '' for v in x]
def highlightCommunalities(x):
return ['background-color: yellow' if v > 0.5 else '' for v in x]
def highlightEigenvalue(x):
return ['background-color: yellow' if v > 1 else '' for v in x]
The dataset utilized in this tutorial originates from Indonesia's Central Statistics Agency and comprises 34 rows—each representing a province—and 30 columns, with 25 indicators concerning the education index and 5 columns serving as identifiers for further analysis.
To acquire the dataset, follow these steps:
# Reading the data
df = pd.read_excel('BPS Indonesia Education Index - Processed Eng.xlsx', engine='openpyxl', sheet_name='Data')
df.dropna(inplace=True, axis=1)
print('Dimension of data: {} rows and {} columns'.format(len(df), len(df.columns)))
df.head()
The education indicators are derived from 7 primary categories, reflecting various education levels (elementary, junior high, senior high, vocational high, and university). The indicators include aspects like classroom conditions, teacher eligibility, and enrollment rates.
To proceed, the non-indicator columns will be eliminated to prepare for standardization using a min-max scaler.
# Filtering numerical data
df_fix = df[[col for col in df.columns if col not in ['Province', 'Code', 'Region', 'Population', 'HDI']]]
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(data=scaler.fit_transform(df_fix), columns=df_fix.columns)
df_scaled.head()
Before conducting factor analysis, it’s essential to verify whether the dataset is suitable for this method through the Bartlett and Kaiser-Meyer-Olkin (KMO) tests.
# Bartlett's test
chiSquareValue, pValue = calculate_bartlett_sphericity(df_scaled)
print('Chi-square value : {}'.format(round(chiSquareValue, ndigits=3)))
print('p-value : {}'.format(round(pValue, ndigits=3)))
# KMO test
KMO, KMO_model = calculate_kmo(df_scaled)
print('KMO value : {}'.format(round(KMO_model, ndigits=3)))
The results of the Bartlett test provide a p-value less than 0.05, leading us to reject the null hypothesis, indicating variability among at least two population variances. Conversely, a KMO value below 0.5 suggests that the data may not be suitable for factor analysis.
In this tutorial, we will now create a factor analysis object and perform the analysis.
# Performing factor analysis
fa = FactorAnalyzer(n_factors=25, rotation=None)
fa.fit(df_scaled)
# Communalities
df_communalities = pd.DataFrame(data={'Column': df_scaled.columns, 'Communality': fa.get_communalities()})
df_communalities.style.apply(highlightCommunalities, subset=['Communality'])
The communalities will indicate which variables are appropriate for further analysis. Once the suitable variables are identified, we will determine the number of factors and interpret them accordingly.
# Check Eigenvalues
eigenValue, value = fa.get_eigenvalues()
df_eigen = pd.DataFrame({'Factor': range(1, len(eigenValue) + 1), 'Eigen value': eigenValue})
df_eigen.style.apply(highlightEigenvalue, subset=['Eigen value'])
Based on the Kaiser criterion, we will identify 6 factors that will categorize the 25 variables into meaningful groups.
Next, we will visualize the explained variance and factor loadings.
# Visualizing explained variance
idx = ['SS Loadings', 'Proportion Variance', 'Cumulative Variance']
df_variance = pd.DataFrame(data=fa.get_factor_variance(), index=idx, columns=facs)
ratioVariance = fa.get_factor_variance()[1] / fa.get_factor_variance()[1].sum()
df_ratio_var = pd.DataFrame(data=ratioVariance.reshape((1, n_factor)), index=['Ratio Variance'], columns=facs)
df_variance.append(df_ratio_var)
This analysis will summarize the proportion of variance explained by each factor, which will inform the weighting process for constructing the composite index.
# Conducting factor analysis with rotation
fa = FactorAnalyzer(n_factors=n_factor, rotation='varimax')
fa.fit(df_scaled)
# Loading factors
pd.DataFrame(data=fa.loadings_, index=df_scaled.columns, columns=facs).style.apply(highlightLoadings)
The rotation will help clarify which indicators dominate each factor, allowing for a more accurate interpretation.
Finally, we will generate the composite index by aggregating the weighted factor scores and assigning ranks to each province based on their education index.
# Aggregating factor scores
dict_index = {}
for i in range(n_factor):
key = df_factors_scaled.columns[i]
value = df_factors_scaled.iloc[:, i].values * df_ratio_var.iloc[:, i].values
dict_index.update({key: value})
df_index = pd.DataFrame(dict_index, index=pd.MultiIndex.from_frame(df[['Province', 'Region']]))
df_index['Composite Index'] = df_index.sum(axis=1).values
df_index['Rank'] = df_index['Composite Index'].rank(ascending=False)
df_index = df_index.sort_values(by='Rank').reset_index()
This education index can then be visualized using various charts to provide insights into the educational landscape across provinces.
In conclusion, constructing a composite indicator involves several steps: variable selection, multivariate analysis, data normalization, weighting, and aggregation. Each step utilizes different methods, and the choice of methodology can significantly affect the final output. This composite index serves as a valuable tool for policymakers aiming to enhance educational outcomes.
References
[1] [BPS] Badan Pusat Statistik. Potret Pendidikan Indonesia: Statistik Pendidikan 2020 (2020), Jakarta (ID): Badan Pusat Statistik.
[2] [OECD] Organization for Economic Co-operation and Development. Handbook on Constructing Composite Indicators: Methodology and User Guide (2008), OECD.