bekkidavis.com

Creating a Comprehensive Education Index from Indonesia's Data

Written on

Chapter 1: Introduction to Composite Indexes

In order to develop thorough policies, decision-makers must identify key areas for enhancement. A composite index allows for the aggregation of various indicators, making it possible to rank different sectors and assess performance.

The Concept of Factor Analysis

Factor analysis (FA) is a linear statistical approach designed to simplify a set of variables by reducing them to a fewer number of underlying factors. This technique highlights the interrelationships among these variables, akin to Principal Component Analysis (PCA). By interpreting the data through factor analysis, analysts can group related variables into factors that explain specific variances. This methodology is frequently applied in areas like market research and advertising.

To dive deeper into factor analysis, consider reviewing this resource:

Hands-on Tutorial

To implement factor analysis and develop a composite index—in this case, an education index—there are several Python modules that need to be installed via the Anaconda Prompt:

  • pandas: for data manipulation (slicing, aggregation, etc.)
  • numpy: for linear algebra computations
  • factor_analyzer: for executing factor analysis, including adequacy tests and loading factor calculations
  • plotnine: for data visualization, similar to ggplot2 in R
  • scikit-learn: for data standardization
  • geopandas: for geospatial data visualization
  • folium: for creating choropleth maps
  • branca: for generating HTML and JavaScript pages using Python

Once the modules are installed, you can start a Jupyter Notebook or your preferred IDE and import the necessary libraries to proceed with the tutorial on creating a composite index.

# Importing required modules

import pandas as pd

import numpy as np

from sklearn.datasets import make_regression

from factor_analyzer import FactorAnalyzer

from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity, calculate_kmo

from sklearn.preprocessing import MinMaxScaler

from plotnine import *

import geopandas as gpd

import folium

import branca

import branca.colormap as cm

from folium.features import GeoJson, GeoJsonTooltip

import json

To facilitate understanding and gain insights from the data, conditional formatting will be applied using the pandas styling method. This will highlight key metrics—loading factors, communalities, and eigenvalues—resulting from the factor analysis.

# Functions for highlighting values

def highlightLoadings(x):

return ['background-color: yellow' if abs(v) > 0.5 else '' for v in x]

def highlightCommunalities(x):

return ['background-color: yellow' if v > 0.5 else '' for v in x]

def highlightEigenvalue(x):

return ['background-color: yellow' if v > 1 else '' for v in x]

The dataset utilized in this tutorial originates from Indonesia's Central Statistics Agency and comprises 34 rows—each representing a province—and 30 columns, with 25 indicators concerning the education index and 5 columns serving as identifiers for further analysis.

To acquire the dataset, follow these steps:

# Reading the data

df = pd.read_excel('BPS Indonesia Education Index - Processed Eng.xlsx', engine='openpyxl', sheet_name='Data')

df.dropna(inplace=True, axis=1)

print('Dimension of data: {} rows and {} columns'.format(len(df), len(df.columns)))

df.head()

The education indicators are derived from 7 primary categories, reflecting various education levels (elementary, junior high, senior high, vocational high, and university). The indicators include aspects like classroom conditions, teacher eligibility, and enrollment rates.

To proceed, the non-indicator columns will be eliminated to prepare for standardization using a min-max scaler.

# Filtering numerical data

df_fix = df[[col for col in df.columns if col not in ['Province', 'Code', 'Region', 'Population', 'HDI']]]

scaler = MinMaxScaler()

df_scaled = pd.DataFrame(data=scaler.fit_transform(df_fix), columns=df_fix.columns)

df_scaled.head()

Before conducting factor analysis, it’s essential to verify whether the dataset is suitable for this method through the Bartlett and Kaiser-Meyer-Olkin (KMO) tests.

# Bartlett's test

chiSquareValue, pValue = calculate_bartlett_sphericity(df_scaled)

print('Chi-square value : {}'.format(round(chiSquareValue, ndigits=3)))

print('p-value : {}'.format(round(pValue, ndigits=3)))

# KMO test

KMO, KMO_model = calculate_kmo(df_scaled)

print('KMO value : {}'.format(round(KMO_model, ndigits=3)))

The results of the Bartlett test provide a p-value less than 0.05, leading us to reject the null hypothesis, indicating variability among at least two population variances. Conversely, a KMO value below 0.5 suggests that the data may not be suitable for factor analysis.

In this tutorial, we will now create a factor analysis object and perform the analysis.

# Performing factor analysis

fa = FactorAnalyzer(n_factors=25, rotation=None)

fa.fit(df_scaled)

# Communalities

df_communalities = pd.DataFrame(data={'Column': df_scaled.columns, 'Communality': fa.get_communalities()})

df_communalities.style.apply(highlightCommunalities, subset=['Communality'])

The communalities will indicate which variables are appropriate for further analysis. Once the suitable variables are identified, we will determine the number of factors and interpret them accordingly.

# Check Eigenvalues

eigenValue, value = fa.get_eigenvalues()

df_eigen = pd.DataFrame({'Factor': range(1, len(eigenValue) + 1), 'Eigen value': eigenValue})

df_eigen.style.apply(highlightEigenvalue, subset=['Eigen value'])

Based on the Kaiser criterion, we will identify 6 factors that will categorize the 25 variables into meaningful groups.

Next, we will visualize the explained variance and factor loadings.

# Visualizing explained variance

idx = ['SS Loadings', 'Proportion Variance', 'Cumulative Variance']

df_variance = pd.DataFrame(data=fa.get_factor_variance(), index=idx, columns=facs)

ratioVariance = fa.get_factor_variance()[1] / fa.get_factor_variance()[1].sum()

df_ratio_var = pd.DataFrame(data=ratioVariance.reshape((1, n_factor)), index=['Ratio Variance'], columns=facs)

df_variance.append(df_ratio_var)

This analysis will summarize the proportion of variance explained by each factor, which will inform the weighting process for constructing the composite index.

# Conducting factor analysis with rotation

fa = FactorAnalyzer(n_factors=n_factor, rotation='varimax')

fa.fit(df_scaled)

# Loading factors

pd.DataFrame(data=fa.loadings_, index=df_scaled.columns, columns=facs).style.apply(highlightLoadings)

The rotation will help clarify which indicators dominate each factor, allowing for a more accurate interpretation.

Finally, we will generate the composite index by aggregating the weighted factor scores and assigning ranks to each province based on their education index.

# Aggregating factor scores

dict_index = {}

for i in range(n_factor):

key = df_factors_scaled.columns[i]

value = df_factors_scaled.iloc[:, i].values * df_ratio_var.iloc[:, i].values

dict_index.update({key: value})

df_index = pd.DataFrame(dict_index, index=pd.MultiIndex.from_frame(df[['Province', 'Region']]))

df_index['Composite Index'] = df_index.sum(axis=1).values

df_index['Rank'] = df_index['Composite Index'].rank(ascending=False)

df_index = df_index.sort_values(by='Rank').reset_index()

This education index can then be visualized using various charts to provide insights into the educational landscape across provinces.

In conclusion, constructing a composite indicator involves several steps: variable selection, multivariate analysis, data normalization, weighting, and aggregation. Each step utilizes different methods, and the choice of methodology can significantly affect the final output. This composite index serves as a valuable tool for policymakers aiming to enhance educational outcomes.

References

[1] [BPS] Badan Pusat Statistik. Potret Pendidikan Indonesia: Statistik Pendidikan 2020 (2020), Jakarta (ID): Badan Pusat Statistik.

[2] [OECD] Organization for Economic Co-operation and Development. Handbook on Constructing Composite Indicators: Methodology and User Guide (2008), OECD.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

# Insights Gained from Utilizing Writesonic: A Personal Account

A personal exploration of the benefits and drawbacks of using the AI writing tool, Writesonic, and how it influenced my writing journey.

Navigating the Evolution of Romance: Insights on Relationships

Explore the dynamics of romantic relationships and the impact of personal growth on love.

Treadmill Running: A Valuable Tool for Your Fitness Journey

Discover how to effectively integrate treadmill running into your workouts for improved cardiovascular health and endurance.