Visualizing ourselves

An Individual Exploration of Classroom Dynamics and Professional Identities

Author

Joel Martín - Student 11

Code
#Making the names anonymus
clase <- clase %>%
  mutate(Student_ID = paste0("Student_", row_number()))

1. Introduction: Context & Motivation

In the era of Big Data, understanding the “Feature Space” of a specific group allows us to identify hidden patterns that define its identity. This report analyzes the 2026 Data Visualization group, which is made of 22 students from the Double Degree in Informatics + CDIA and CDIA alone.

The motivation behind this study is:

  1. Global Structure: To identify if there is a unified analytics profile or if interest (Maths, Arts, Programming…) create distint clusters.
  2. Individual Positioning: To locate myself within this high-dimensional space. By applying distance metrics and dimensionality reduction, I will determine which traits make me similar or different from the class average.

2. Data Wrangling & Quality Assurance

2.1 Reshaping the Feature Space

The dataset was provided in a transposed format, where students were represented as columns and variables as rows. In data science, for a “Feature Space” to be correctly interpreted by algorithms, each observation (student) must be a row and each feature (question) a column.

Using ‘tidyr’, I performed a transformation to pivot the data, ensurng that:

  1. Each row represents a unique student
  2. Question strings are converted into clean variable names
  3. Scale responses (1-5) are correctly typed as numeric doubles

This step was fundamental, without this transformation, distance metrics wold calculate similarities between questions rather than between individuals.

2.2 Missing Value Analysis

Following best practices for data quality, I performed an integrity check. As shown, the dataset is completely populated (0 NAs)

Code
#Checking for missing values across the entire dataset.
missing_per_col <- colSums(is.na(clase))
kable(missing_per_col, col.names=c("Missing Values"),
      caption="Verification of Data Completeness per Variable")
Verification of Data Completeness per Variable
Missing Values
Student_ID 0
I like the degree I’m studing 0
I like programming 0
I like Computer Games 0
I like Data Science 0
I like maths 0
I cannot live without knowing more about Data Visualization 0
My preferred music style is 0
I’m studing 0
I expect to work in Industry 0
I expect to work in research 0
I expect to work in education 0
I like watching sports 0
I like practicing sports 0
I like playing music 0
I like acting 0
I like painting 0
I like a different artistic expression (not listed here) 0
I like watching reels 0
I like listening to music 0
I like Alangua 0
I like vlogging 0
I like driving 0
I like messi 0
I like lying 0

3. Distance Metrics & Similarity Analysis

To understand the structure of our class, we mus define how “close” two students are based on their profiles. I have chosen two fundamental metrics:

  1. Euclidean Distance: Measuring the straight-line in our 25-dimensional space.
  2. Manhattan Distance: Measuring the sum of absolute differences, which can be more robust in high-dimensional survey data.

3.1 Computing the Distance Matrices

Before calculating distances, it is crucial to scale the data. Since all our variables are on the same Likert scale (1-5), the differences are comparable, but scaling ensures that no single variable dominates the distance calculation due to its variance.

Code
#Selecting only numeric features for distance calculation
dist_data <- clase %>% 
  select(where(is.numeric))

rownames(dist_data) <- clase$Student_ID #Displaying the names

#Calculating Euclidean and Manhattan distances
dist_euclidean <- dist(dist_data, method="euclidean")
dist_manhattan <- dist(dist_data, method="manhattan")

#Visualizing the Euclidean distance matrix with a Heatmap
pheatmap(as.matrix(dist_euclidean),
         cluster_rows = FALSE,
         cluster_cols = FALSE,
         clustering_distance_rows = dist_euclidean,
         clustering_distance_cols = dist_euclidean,
         color= colorRampPalette(c("#27AE60", "white", "#E74C3C"))(50),
         main= "Euclidean Distance Heatmap: Student Proximity",
         display_numbers = FALSE)

3.2 Interpreting the Distance Matrix

The heatmap above represents the Euclidean Distance between every pair of students in our 2026 class.

How to read this space?:

  • The Identity Diagonal: The perfect green line crossing from top-left to bottom-right represents the distance of a student to themselves (d=0)

  • Green Zones (Proximity): These areas identify “data twins”. Students who share almost identical interests and professional goals.

  • Red Zones (Distance): These represent students with opposing profiles.

Key Insights:

  • The “Student 4” Anomaly: If we look at his row and column, we see a predominantly red pattern. Mathematically, acts as an outlier in our classroom; his interests or expectations are significantly different form the majority of the group, creating high Euclidean distances.

  • Student 3 & Student 22’s Symmetry: We can distinct a notorious green intersection between Student 3 and Student 22 This indicates they are “neighbors” in our feature space, likely shating high scores in the same categories.

  • Strong Divergences: The intersection between Student 18 and Student 19 shows one of the most intense red squares in the matrix. This suggest they are “polar opposites” within the class. What one values highly, the other likely disregards

  • General Class Cohesion: Most of the matrix fluctuates between light red and white (distances of 6 to 8). This shows that while we are not a homogeneous group, there is a common baseline of interests that keeps the class from being entirely opposite.

4. Multivariate Analysis: Parallel Coordinates

While the Euclidean distance matrix provides a summarized metric of similarity, it obscures the specific variables that drive those distances. To deconstruct the feature space, I have implemented Parallel Coordinated Plots. This technique allows for the visualization of high-dimensional data.

Code
#| echo: true
#| message: false
#| waring: false
#| fig_wigth: 12
#| fig_height: 6

parallel_df <- clase %>% 
  select(Student_ID, where(is.numeric)) %>% 
  rename_with(~str_remove_all(., "I like |I expect to work in |I cannot live without knowing about "), #Making the graph more clear
              .cols=everything())

#Filtering the specific pair identified in the heatmap (Nahia & Marlena)
pair_similar <- parallel_df %>% 
  filter(Student_ID %in% c("Student_3", "Student_22"))

ggparcoord(pair_similar,
           columns = 2:ncol(pair_similar),
           groupColumn = 1,
           showPoints = TRUE,
           title= "Parallel Coordinates: Visualizing High Proximity",
           alphaLines = 0.8,
           scale="globalminmax")+
  theme_minimal()+
  scale_color_manual(values=c("Student_3"="darkgreen", "Student_22"="lightgreen"))+
  theme(axis.text.x = element_text(angle=45, hjust=1))+
  labs(color="Student", y="Score (1-5)", x="Variable")

Code
#Filtering opposites pair
pair_different <- parallel_df %>% 
  filter(Student_ID %in% c("Student_18", "Student_19"))

ggparcoord(pair_different,
           columns = 2:ncol(pair_different), 
           groupColumn = 1,
           showPoints = TRUE, 
           title = "Parallel Coordinates: Visualizing Low Proximity (Jon & Zaloa)",
           alphaLines = 0.8,
           scale = "globalminmax") +
  theme_minimal() +
  scale_color_manual(values = c("Student_18" = "darkred", "Student_19" = "red")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Features", y = "Score (1-5)", color = "Student")

4.1 The “Data Twins”: Student 3 & Student 22

The first plot illustrates the profiles of Student 3 and Student 22, who exhibited one of the lowest distances in the cohort.

  • Observation of synchronicity: The two lines move in almost perfect harmony. Notice the convergence in “Computer Games” (0), “Education” (0), and “Alangua” (0). Even in lifestyle variables like “Watching Sports” or “Practicing Sports,” they share identical scores.

  • Minor Divergence: The only slight gaps appear in a few variables, yet the overall trend remains identical.

  • Conclusion: This visual overlapping confirms that their proximity in the heatmap is due to a shared identity. They don’t just have similar averages; they have similar patterns of interest

4.2 The “Polar Opposites”: Student 18 & Student 19

The second plot compares Student 18 and Student 19, whose profiles represent a high degree of mathematical divergence.

  • Observation of “Crossings”: Unlike the previous plot, these lines are constantly crossing each other, which is a classic visual indicator of high Euclidean distance.

  • Key Clashes:

    • Technical vs. Artistic: Student 18 peaks at 5 in “Painting” while Student 19 drops to 0. Conversely, Student 19 peaks at 4 in “Programming” and “Maths” while Student 18 drops to 1.

    • Social & Media: There is a massive gap in “Alangua” and “Vlogging,” where Student 19 hits 5 and Student 18 stays at 1.

    • The “Lying” & “Messi” Factors: Even in the more informal variables, Student 18 scores a 5 in “Lying” and a 4 in “Messi,” while Student 19 scores significantly lower.

  • Conclusion: This plot explains the “Intense Red” seen in the distance matrix. Their profiles are not just different; they are almost inversely correlated in several key areas. Where one finds passion, the other finds indifference.

5. Dimensionality Reduction: PCA

To conclude the mapping of our classroom, I have applied PCA. This technique allows us to project our 22-dimensional feature space into a 2D plane, identifying the latent variables that explain the highest variance in our interests and expectations.

5.1 The Classroom Biplot: Mapping Students and Variables

The following Biplot displays both the students as points and the original variables as vectors. The direction and length of the vectors indicate how much each variable contributes to the two main dimensions.

Code
#Preparing the data for the PCA
pca_data <- clase %>% 
  select(Student_ID, where(is.numeric)) %>% 
  column_to_rownames("Student_ID")

#Executing the PCA
res.pca <- prcomp(pca_data, scale=TRUE)

#Biplot
fviz_pca_biplot(res.pca,
                repel=TRUE,
                col.var="#2E86C1",
                col.ind="#D35400",
                label="all",
                title="PCA Biplot: The 2026 Analytics Feature Space",
                ggtheme=theme_minimal())+
  labs(x="Dimension 1 (Variance Explained)",
       y="Dimension 2 (Variance Explained)")

5.2 Interpreting the Biplot: Archetypes and Classroom Dynamics

The PCA Biplot provides a definitive map of our Feature Space. By observing the orientation of the blue vectors and the distribution of students, we can decode the underlying structure of the 2026 Analytics cohort.

  1. Decoding the Axes:

    • Dimension 1 (Horizontal - Technical vs Industry): This axis explain the largest portion of our class variance. To the right, we see a strong concentration of a technical and academic variables: I like maths, I like programming and I expect to work ind research. To the left, the space is dominated by I like lying and I expect to work in industry

    • Dimension 2 (Vertical - Lifestyle & Art): This axis separates students based on their extracurricular vibe. The top is defined by I like practicing sports, while the bottom is heavily pulled by I like painting, I like acting and I Like Computer Games

  2. Students Clusters and Identities

    Based on their coordinates, we can identify four distinct cuadrants:

    • The Technical-Academic (Top Right): Students like Student 2, Student 19, and Student 17 are located here. They show a high correlation with mathematical interest and research-oriented career paths.
    • The Industry-Pragmatic (Top Left): This is where Student 3 and Student 22 reside. Their proximity in this map (close to the Industry vector) reaffirms our “Data Twins” theory from the previous section. They are balanced but lean towards the corporate appliation of data.
    • The Creative-Gamer (Bottom Right): Students like Student 11 and Student 15 are influenced by the vectors of Acting, Computer Games and Liking the degree.
    • The Class “Average” (Origin): Near the center, we find Student 13, Student 14, Student 6 and Student 14. These students represent the core identity of the class, showing balanced insterests without extreme polarization.
  3. The statistical Outliers

    The PCA highlights 2 students who are “unique” in their high-dimensional fingerprint:

    • Student 18: Located at the extreme left of the chat. His position is almost entirely driven by his high scores in Industry and Lying, distancing him form the technical-researcgh cluster
    • Student 4: Located at the absolute bottom of the map, He is the most “Artistic/Gamer” profile in the class, pulled down by the Painting, Acting and Computer Games vectors, showing a profile that is mathematically very different form the rest.

5.3 Where am I (Joel/Student 11) located?

In the global landscape of the 2026 Analytics, I am located in the Bottom Right quadrant. This position is not random; it defines a very specific profile within our “Feature Space”:

  • Main Drivers: My location is primarily dictated by the long vectors of I like the degree I’m studying and I like acting. This suggests that my profile is characterized by a high degree of academic satisfaction combined with a strong creative/artistic inclination.

  • The Technical-Creative Balance: Being on the right side of the map, I align with the technical core of the class (Maths and Programming). However, my vertical position (towards the bottom) separates me from the “pure” academic cluster (like Student 7 or Student 19), showing that I integrate hobbies like Computer Games and Acting as a core part of my identity.

  • Comparison to the Average: While students like Student 17 or Student 20 represent the class average near the center, I am an Edge Case. This means my profile is more specialized and defined than the median.

  • Peers & Neighbors: My closest neighbor in this high-dimensional space is Student 15 We share a similar “Creative-Technical” fingerprint, distancing ourselves from the more industry-only focused profiles located on the opposite side of the map (like Student 18).

6. Final Conclusions: The Identity of the 2026 Cohort

The high-dimensional analysis of our classroom reveals that we are far from a homogeneous group. While we share a common academic path, our “Feature Space” is naturally fragmented into different archetypes:

  • Cohesion vs. Individuality: The distance matrix showed that while most of the class shares a “core” of interests, certain individuals (outliers) provide the necessary diversity for a rich learning environment.

  • Validation of the Model: The consistency between the Euclidean distances, the Parallel Coordinates, and the PCA confirms that our responses are not random. There is a logic behind our interests: those who prefer the abstract (maths/research) tend to distance themselves from the pragmatic (industry/vlogging).

  • The Power of Viz: This exercise demonstrates that data visualization is not just about making “pretty charts,” but about uncovering the latent structures that define a human group.

Annex: Data Cleaning & Wrangling

To ensure the reproducibility of this report, the following R code was used to transform the raw classroom data. This process involved transposing the original matrix (students as columns) into a tidy format (students as rows/observations) and handling the data types for the Likert scales.

Code
#Loading the raw file
raw_data <- read_csv("ClassRoomAnalyticsForm2026.csv")

#Transposing the data (Pivoting)
clean_clase <- raw_data %>%
  pivot_longer(cols = -1, names_to = "Student_ID", values_to = "Value") %>%
  pivot_wider(names_from = 1, values_from = Value)

#Numeric conversion and cleaning
clean_clase <- clean_clase %>%
  mutate(across(where(is.character), ~as.numeric(str_extract(., "\\d")))) %>%
  mutate(Student_ID = colnames(raw_data)[-1])