Code
#Making the names anonymus
clase <- clase %>%
mutate(Student_ID = paste0("Student_", row_number()))An Individual Exploration of Classroom Dynamics and Professional Identities
#Making the names anonymus
clase <- clase %>%
mutate(Student_ID = paste0("Student_", row_number()))In the era of Big Data, understanding the “Feature Space” of a specific group allows us to identify hidden patterns that define its identity. This report analyzes the 2026 Data Visualization group, which is made of 22 students from the Double Degree in Informatics + CDIA and CDIA alone.
The motivation behind this study is:
The dataset was provided in a transposed format, where students were represented as columns and variables as rows. In data science, for a “Feature Space” to be correctly interpreted by algorithms, each observation (student) must be a row and each feature (question) a column.
Using ‘tidyr’, I performed a transformation to pivot the data, ensurng that:
This step was fundamental, without this transformation, distance metrics wold calculate similarities between questions rather than between individuals.
Following best practices for data quality, I performed an integrity check. As shown, the dataset is completely populated (0 NAs)
#Checking for missing values across the entire dataset.
missing_per_col <- colSums(is.na(clase))
kable(missing_per_col, col.names=c("Missing Values"),
caption="Verification of Data Completeness per Variable")| Missing Values | |
|---|---|
| Student_ID | 0 |
| I like the degree I’m studing | 0 |
| I like programming | 0 |
| I like Computer Games | 0 |
| I like Data Science | 0 |
| I like maths | 0 |
| I cannot live without knowing more about Data Visualization | 0 |
| My preferred music style is | 0 |
| I’m studing | 0 |
| I expect to work in Industry | 0 |
| I expect to work in research | 0 |
| I expect to work in education | 0 |
| I like watching sports | 0 |
| I like practicing sports | 0 |
| I like playing music | 0 |
| I like acting | 0 |
| I like painting | 0 |
| I like a different artistic expression (not listed here) | 0 |
| I like watching reels | 0 |
| I like listening to music | 0 |
| I like Alangua | 0 |
| I like vlogging | 0 |
| I like driving | 0 |
| I like messi | 0 |
| I like lying | 0 |
To understand the structure of our class, we mus define how “close” two students are based on their profiles. I have chosen two fundamental metrics:
Before calculating distances, it is crucial to scale the data. Since all our variables are on the same Likert scale (1-5), the differences are comparable, but scaling ensures that no single variable dominates the distance calculation due to its variance.
#Selecting only numeric features for distance calculation
dist_data <- clase %>%
select(where(is.numeric))
rownames(dist_data) <- clase$Student_ID #Displaying the names
#Calculating Euclidean and Manhattan distances
dist_euclidean <- dist(dist_data, method="euclidean")
dist_manhattan <- dist(dist_data, method="manhattan")
#Visualizing the Euclidean distance matrix with a Heatmap
pheatmap(as.matrix(dist_euclidean),
cluster_rows = FALSE,
cluster_cols = FALSE,
clustering_distance_rows = dist_euclidean,
clustering_distance_cols = dist_euclidean,
color= colorRampPalette(c("#27AE60", "white", "#E74C3C"))(50),
main= "Euclidean Distance Heatmap: Student Proximity",
display_numbers = FALSE)The heatmap above represents the Euclidean Distance between every pair of students in our 2026 class.
How to read this space?:
The Identity Diagonal: The perfect green line crossing from top-left to bottom-right represents the distance of a student to themselves (d=0)
Green Zones (Proximity): These areas identify “data twins”. Students who share almost identical interests and professional goals.
Red Zones (Distance): These represent students with opposing profiles.
Key Insights:
The “Student 4” Anomaly: If we look at his row and column, we see a predominantly red pattern. Mathematically, acts as an outlier in our classroom; his interests or expectations are significantly different form the majority of the group, creating high Euclidean distances.
Student 3 & Student 22’s Symmetry: We can distinct a notorious green intersection between Student 3 and Student 22 This indicates they are “neighbors” in our feature space, likely shating high scores in the same categories.
Strong Divergences: The intersection between Student 18 and Student 19 shows one of the most intense red squares in the matrix. This suggest they are “polar opposites” within the class. What one values highly, the other likely disregards
General Class Cohesion: Most of the matrix fluctuates between light red and white (distances of 6 to 8). This shows that while we are not a homogeneous group, there is a common baseline of interests that keeps the class from being entirely opposite.
While the Euclidean distance matrix provides a summarized metric of similarity, it obscures the specific variables that drive those distances. To deconstruct the feature space, I have implemented Parallel Coordinated Plots. This technique allows for the visualization of high-dimensional data.
#| echo: true
#| message: false
#| waring: false
#| fig_wigth: 12
#| fig_height: 6
parallel_df <- clase %>%
select(Student_ID, where(is.numeric)) %>%
rename_with(~str_remove_all(., "I like |I expect to work in |I cannot live without knowing about "), #Making the graph more clear
.cols=everything())
#Filtering the specific pair identified in the heatmap (Nahia & Marlena)
pair_similar <- parallel_df %>%
filter(Student_ID %in% c("Student_3", "Student_22"))
ggparcoord(pair_similar,
columns = 2:ncol(pair_similar),
groupColumn = 1,
showPoints = TRUE,
title= "Parallel Coordinates: Visualizing High Proximity",
alphaLines = 0.8,
scale="globalminmax")+
theme_minimal()+
scale_color_manual(values=c("Student_3"="darkgreen", "Student_22"="lightgreen"))+
theme(axis.text.x = element_text(angle=45, hjust=1))+
labs(color="Student", y="Score (1-5)", x="Variable")#Filtering opposites pair
pair_different <- parallel_df %>%
filter(Student_ID %in% c("Student_18", "Student_19"))
ggparcoord(pair_different,
columns = 2:ncol(pair_different),
groupColumn = 1,
showPoints = TRUE,
title = "Parallel Coordinates: Visualizing Low Proximity (Jon & Zaloa)",
alphaLines = 0.8,
scale = "globalminmax") +
theme_minimal() +
scale_color_manual(values = c("Student_18" = "darkred", "Student_19" = "red")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Features", y = "Score (1-5)", color = "Student")The first plot illustrates the profiles of Student 3 and Student 22, who exhibited one of the lowest distances in the cohort.
Observation of synchronicity: The two lines move in almost perfect harmony. Notice the convergence in “Computer Games” (0), “Education” (0), and “Alangua” (0). Even in lifestyle variables like “Watching Sports” or “Practicing Sports,” they share identical scores.
Minor Divergence: The only slight gaps appear in a few variables, yet the overall trend remains identical.
Conclusion: This visual overlapping confirms that their proximity in the heatmap is due to a shared identity. They don’t just have similar averages; they have similar patterns of interest
The second plot compares Student 18 and Student 19, whose profiles represent a high degree of mathematical divergence.
Observation of “Crossings”: Unlike the previous plot, these lines are constantly crossing each other, which is a classic visual indicator of high Euclidean distance.
Key Clashes:
Technical vs. Artistic: Student 18 peaks at 5 in “Painting” while Student 19 drops to 0. Conversely, Student 19 peaks at 4 in “Programming” and “Maths” while Student 18 drops to 1.
Social & Media: There is a massive gap in “Alangua” and “Vlogging,” where Student 19 hits 5 and Student 18 stays at 1.
The “Lying” & “Messi” Factors: Even in the more informal variables, Student 18 scores a 5 in “Lying” and a 4 in “Messi,” while Student 19 scores significantly lower.
Conclusion: This plot explains the “Intense Red” seen in the distance matrix. Their profiles are not just different; they are almost inversely correlated in several key areas. Where one finds passion, the other finds indifference.
To conclude the mapping of our classroom, I have applied PCA. This technique allows us to project our 22-dimensional feature space into a 2D plane, identifying the latent variables that explain the highest variance in our interests and expectations.
The following Biplot displays both the students as points and the original variables as vectors. The direction and length of the vectors indicate how much each variable contributes to the two main dimensions.
#Preparing the data for the PCA
pca_data <- clase %>%
select(Student_ID, where(is.numeric)) %>%
column_to_rownames("Student_ID")
#Executing the PCA
res.pca <- prcomp(pca_data, scale=TRUE)
#Biplot
fviz_pca_biplot(res.pca,
repel=TRUE,
col.var="#2E86C1",
col.ind="#D35400",
label="all",
title="PCA Biplot: The 2026 Analytics Feature Space",
ggtheme=theme_minimal())+
labs(x="Dimension 1 (Variance Explained)",
y="Dimension 2 (Variance Explained)")The PCA Biplot provides a definitive map of our Feature Space. By observing the orientation of the blue vectors and the distribution of students, we can decode the underlying structure of the 2026 Analytics cohort.
Decoding the Axes:
Dimension 1 (Horizontal - Technical vs Industry): This axis explain the largest portion of our class variance. To the right, we see a strong concentration of a technical and academic variables: I like maths, I like programming and I expect to work ind research. To the left, the space is dominated by I like lying and I expect to work in industry
Dimension 2 (Vertical - Lifestyle & Art): This axis separates students based on their extracurricular vibe. The top is defined by I like practicing sports, while the bottom is heavily pulled by I like painting, I like acting and I Like Computer Games
Students Clusters and Identities
Based on their coordinates, we can identify four distinct cuadrants:
The statistical Outliers
The PCA highlights 2 students who are “unique” in their high-dimensional fingerprint:
In the global landscape of the 2026 Analytics, I am located in the Bottom Right quadrant. This position is not random; it defines a very specific profile within our “Feature Space”:
Main Drivers: My location is primarily dictated by the long vectors of I like the degree I’m studying and I like acting. This suggests that my profile is characterized by a high degree of academic satisfaction combined with a strong creative/artistic inclination.
The Technical-Creative Balance: Being on the right side of the map, I align with the technical core of the class (Maths and Programming). However, my vertical position (towards the bottom) separates me from the “pure” academic cluster (like Student 7 or Student 19), showing that I integrate hobbies like Computer Games and Acting as a core part of my identity.
Comparison to the Average: While students like Student 17 or Student 20 represent the class average near the center, I am an “Edge Case”. This means my profile is more specialized and defined than the median.
Peers & Neighbors: My closest neighbor in this high-dimensional space is Student 15 We share a similar “Creative-Technical” fingerprint, distancing ourselves from the more industry-only focused profiles located on the opposite side of the map (like Student 18).
The high-dimensional analysis of our classroom reveals that we are far from a homogeneous group. While we share a common academic path, our “Feature Space” is naturally fragmented into different archetypes:
Cohesion vs. Individuality: The distance matrix showed that while most of the class shares a “core” of interests, certain individuals (outliers) provide the necessary diversity for a rich learning environment.
Validation of the Model: The consistency between the Euclidean distances, the Parallel Coordinates, and the PCA confirms that our responses are not random. There is a logic behind our interests: those who prefer the abstract (maths/research) tend to distance themselves from the pragmatic (industry/vlogging).
The Power of Viz: This exercise demonstrates that data visualization is not just about making “pretty charts,” but about uncovering the latent structures that define a human group.
To ensure the reproducibility of this report, the following R code was used to transform the raw classroom data. This process involved transposing the original matrix (students as columns) into a tidy format (students as rows/observations) and handling the data types for the Likert scales.
#Loading the raw file
raw_data <- read_csv("ClassRoomAnalyticsForm2026.csv")
#Transposing the data (Pivoting)
clean_clase <- raw_data %>%
pivot_longer(cols = -1, names_to = "Student_ID", values_to = "Value") %>%
pivot_wider(names_from = 1, values_from = Value)
#Numeric conversion and cleaning
clean_clase <- clean_clase %>%
mutate(across(where(is.character), ~as.numeric(str_extract(., "\\d")))) %>%
mutate(Student_ID = colnames(raw_data)[-1])