Artificial Intelligence

This page reports some of the work and experimentations that I did on artificial intelligence, by applying popular machine-learning methods and Python libraries such as TensorFlow and OpenCV.

Large Language Models

Large language models (LLMs) are deep learning algorithms that can perform a variety of natural language processing tasks. The most popular LLMs include Bard, ChatGPT, Falcon, etc.

In the video below, I show a voice-controlled, Python-powered AI desktop assistant similar to the fictional J.A.R.V.I.S., which I created by integrating Google’s Bard with several other modules and libraries. Since at the time of writing (Sep 22nd 2023) Google has not yet released its official API, in this project I used the unofficial Bard-API developed by Minwoo(Daniel) Park.

Other Python’s modules and libraries that I used are DeepFace (face detection and recognition), Mediapipe (background removal), Vosk (speech-to-text conversion), Pyttsx3 (text-to-speech conversion), Pygame (mixer module for playing music), Pywhatkit (sending WhatsApp messages), and webbrowser (browser interfacing).

The video is in Italian language with English subtitles.

Computer Vision

Computer vision is a field of artificial intelligence that enables computers to derive meaningful information from digital images, videos and other visual inputs, and then to take actions or make recommendations based on that information. Some of the most important applications are listed below.

Facial Image Processing

Faces are among the most important classes of objects computers have to deal with. For this reason, automatic processing of facial images has attracted considerable attention in the last decades.

The following video shows some examples of facial image processing, such as face detection and recognitionfacial landmark detection and facial attribute analysis, that I performed through OpenCV, DeepFace, and other computer vision libraries for Python.

Object Classification

Convolutional neural networks (CNNs) are biologically-inspired, hierarchical networks most commonly applied to analyze visual imagery. Typical applications include image and video recognition, image segmentation and classification, medical image analysis, etc. CNNs mimic the hierarchical architecture of the ventral visual stream in the brain. This pathway represents an occipitotemporal network linking early visual areas and the anterior inferior temporal cortex along multiple routes through which visual information is processed.

The human visual system
Hubel and Wiesel discovered two major cell types in the primary visual cortex (V1). The first type, the simple cells (left, green in the picture below), have preferred locations in the image (dashed ovals) wherein they respond most strongly to bars of light or dark with a particular orientation.
Comparison between the architecture of V1 and that of CNNs

Each cell has an orientation of the bar at which it fires most, with its response falling off as the angle of the bar changes from this preferred orientation, thereby creating an orientation ‘tuning curve’.

The second type, complex cells (orange), receive input from many simple cells, all with the same preferred orientation but with slightly different preferred locations. These operations are replicated in a convolutional neural network (right in the picture above). With several iterations of the stacked bundle of convolution-nonlinearity-pooling, this creates a hierarchical model that mimics not just the operations of V1 but the ventral visual pathway as a whole.

I applied CNNs to the patch_camelyon dataset (see this link), which consists of 327,680 color images (96x96px) extracted from histopathologic scans of lymph node sections. The purpose of this experimentation was to determine whether a scan shown to the CNN contained metastatic tissue.

An example of CNN used to detect metastatic tissue in patch_camelyon scans
An example of CNN used to detect metastatic tissue in patch_camelyon scans

After experimenting with different numbers of layers, learning rates etc, I achieved an accuracy of approximately 80% on the testing dataset.

Object Detection

Object detection is the task of detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Object detection is useful for a variety of tasks, including vehicle counting, face detection, object tracking, and more.

The following video shows some examples of object detection that I performed through OpenCV by importing the weights of a pre-trained YOLOv4 neural network.

In the video below I trained a YOLOv5 network on Google Colab, by using custom data that I collected on the internet. The task of the network was to detect two categories in video frames: TIGERS and LIONS. Even if the network was trained on a few tens of images, and despite the two categories have significant similarities, the network achieved a good performance after 80-100 epochs.

Semantic Segmentation

Semantic segmentation is the task of clustering parts of an image together, which belong to the same object class. More specifically, the goal of semantic segmentation is to label each pixel of an image with a corresponding class of what is being represented. Semantic segmentation is useful for a variety of tasks, including self-driving vehicles and medical image diagnostics. The following video shows some examples of semantic segmentation that I performed through OpenCV by importing the weights of a pre-trained Mask R-CNN.

Instance Segmentation

While semantic segmentation treats multiple objects that belong to the same class as a single entity, instance segmentation differentiates multiple instances of the same class. In other words, instance segmentation assigns different colors or labels to the instances of the same class (e.g. “Person 1”, “Person 2”, etc). For this reason, instance segmentation produces a richer output as compared to both semantic segmentation and object detection (see previous sections).

The following video shows an example of instance vs semantic segmentation, which I obtained through pre-trained network models in PixelLib.

Object Tracking

Object tracking is a computer vision application where the algorithm tracks the movement of objects in space or across different camera angles. In other words, the algorithm takes an initial set of object detections, creates a unique ID for each of them, and then tracks the objects as they move around frames in a video, while maintaining the ID assignment.

The following video shows some examples of object tracking (+ semantic segmentation) that I performed through a pre-trained YOLOv8 model and the BoT-SORT tracker.

Hand Tracking

Hand tracking is a computer vision technology to detect a hand from an input image and to keep focus on the hand’s movement and orientation. The user’s hand movements can be used to control robots, drones, videogames, and much more.

In the following video, I combined the Mediapipe’s hand tracking module with the Pygame‘s mixer module to create a hand-controlled music player. This allowed me to play music without using the mouse or the keyboard. Note that I implemented my own algorithm to recognize gestures in real time, which is based on the relative positions of the hand joints.

Here below I show another video, which provides a simple proof of concept for creating hand-controlled videogames. In this video, I modified the original Pymaze script published at this link, and I integrated it with the Mediapipe’s hand tracking moduleI show that it is possible to beat the maze by drawing the path in the air with a finger.

Human Pose Estimation

Human pose estimation (HPE) is a way of identifying and classifying the joints in the human body. HPE is useful for a variety of tasks, such as sign language recognition and full-body gesture control. It can also be used to display digital content and information on top of the physical world in augmented reality.

The following video shows some examples of HPE from video that I performed through Google’s MediaPipe. Then I imported the pose into Unity to animate a basic 3D model of the human body, which suggests possible applications of this technology in videogames development.

Other Biologically-Inspired Networks

I also did some work on replicating in silico the mechanisms of orientation tuning of the primary visual cortex. Specifically, I considered a recurrent neural network of neurons with sigmoidal activation function, which mimics some of the most important neurophysiological aspects of the orientation hypercolumns in V1 (see the picture below).

A model of orientation hypercolumns in V1

Then, by performing a mathematical study of the local bifurcations of the model, I investigated how changes in the external stimuli to the hypercolumn affect the dynamics of its firing activity. I extended previous work, which focused on the mean-field (i.e. infinite-size) approximation of the hypercolumn equations, to the mathematically more complex case of a model with a finite number of microcolumns and neurons. The analysis revealed an explosion of complexity in the dynamical behavior of the model, which extends previous results obtained for the case of a single-microcolumn network.

AI-Driven Snake Game

This algorithm is an example of deep Q-learning applied to the Snake game. When the training starts, the algorithm doesn’t know the rules of the game, and it performs random moves to explore the environment. At the same time, the algorithm uses a deep neural network to approximate the Q-function of the game, which mathematically describes the knowledge acquired during the learning process. The snake behaves less and less randomly over time, and it starts exploiting the Q-function to make moves. After sufficient training, the algorithm learns to play the game.
Typical framing of a deep Q-learning scenario
The following video shows that a deep neural network with a hidden layer of 8,192 neurons can learn to play the Snake game while avoiding obstacles in the environment.

Typically, a deep Q-learning algorithm stores a subset of training data in a memory buffer that it reviews offline. This process, called “experience replay”, allows the algorithm to learn anew from successes or failures that occurred in the past, thereby avoiding catastrophic forgetting of previous knowledge. Experience replay mimics the memory consolidation process that has been postulated to occur in the hippocampus, where the neural activity related to a recent experience has been observed to spontaneously reoccur (for more information, see this link). In Alzheimer’s disease, damage to the brain generally starts in the hippocampal-entorhinal cortex system (see here), suggesting that experience replay is potentially impaired during the course of the disease.

The hippocampus in the human brain (3D model from the BodyParts3D/Anatomography database).

Physics Informed Neural Networks (PINNs)

PINNs are deep neural networks which are trained to solve supervised learning tasks while respecting any given law of physics (see this reference for more details). The law of physics is typically described by general nonlinear partial differential equations (PDEs), such as the Schrödinger’s equation in quantum physics. It follows that a specific application of PINNs is to approximate the unknown solutions of a given PDE, by enforcing the equation and its initial and boundary conditions in the loss function of a deep-learning algorithm.
An example of PINN used to solve a second-order PDE
Here I focus on a specific nonlinear reaction–diffusion PDE known as the Fisher’s equation:

which is extensively used in research on Alzheimer’s disease to simulate the spreading of misfolded proteins between brain regions (with u(t,x) being the misfolded protein concentration).

The following video shows the integration of the Fisher’s equation by a PINN with 8 hidden layers of 20 neurons each. Note that in this example the solution u(t,x) spreads over time along the x-axis (travelling wave solution). In other words, once misfolded protein is present anywhere in the brain (u>0), the concentration will always be repelled from the unstable, benign state u=0, and attracted to the misfolded, stable state u=1.

MENU