Custom Gesture Recognition Model for Video Playback using MediaPipe
Introduction
What is MediaPipe?
MediaPipe is Google's open-source framework for building multimodal (e.g., video, audio, etc.) machine learning pipelines. It is highly efficient and versatile, making it perfect for tasks like gesture recognition.
This is a tutorial on how to make a custom model for gesture recognition tasks based on the Google MediaPipe API. This tutorial is specifically for video-playback, though could be generalized to image and live-video feed recognition.
Note, it is possible to fine tune training using hyperparameters:
dropout_rate: The percentage of input units to ignore in a dropout layer. Default is 5%.
layer_widths: A list specifying the number of units in each hidden layer for the gesture model. Each value creates a new hidden layer with that number of units. These hidden layers include BatchNorm, Dropout, and ReLU. Default is an empty list (no hidden layers).
Customizable Parameters for HParams (affecting model accuracy):
learning_rate: The speed at which the model learns during training. Default is 0.001.
batch_size: The number of samples processed before the model updates. Default is 2.
epochs: The number of times the model will see the entire dataset during training. Default is 10.
steps_per_epoch: (Optional) The number of steps (batches) to run per epoch. If not set, the default is the size of the training dataset divided by the batch size.
shuffle: Whether the dataset is mixed before training. Default is False.
lr_decay: The rate at which the learning rate decreases over time. Default is 0.99.
gamma: A parameter used for focal loss. Default is 2.
eg. the following trains a model with dropout rate of 0.02 and learning rate of 0.003
Modify the previous hand tracking script to recognize gestures and control video playback.
importcv2importmediapipeasmpimportnumpyasnpfrommediapipe.tasks.python.components.containers.landmarkimportNormalizedLandmarkfrommediapipe.framework.formatsimportlandmark_pb2#modify path with your custom model and video pathvideo_file_path='your_video.mp4'gesture_model='gesture_recognizer.task'#create gesture recognizerGestureRecognizer=mp.tasks.vision.GestureRecognizerGestureRecognizerOptions=mp.tasks.vision.GestureRecognizerOptions#Using your custom model to create the options for the gesture recognizergesture_options=GestureRecognizerOptions( base_options=BaseOptions(model_asset_buffer=open(gesture_model,"rb").read()), running_mode=VisionRunningMode.VIDEO)#create instance of hand tracker and gesture recognizerwithGestureRecognizer.create_from_options(gesture_options) asrecognizer:writer=cv2.VideoWriter("demo.avi",cv2.VideoWriter_fourcc(*"MJPG"),12.5,(640,480)) # algo makes a frame every ~80ms = 12.5 fpswhilecap.isOpened():success,image=cap.read()# if cannot open video fileifnotsuccess:break# To improve performance, optionally mark the image as not writeable to# pass by reference.image.flags.writeable=Falseimage=cv2.cvtColor(image,cv2.COLOR_BGR2RGB)results=hands.process(image)# gesture classification data arrayscurrent_gestures= []current_handedness= []current_score= []# recognize gesturesmp_image=mp.Image(image_format=mp.ImageFormat.SRGB, data=image)gesture_recognition_result=recognizer.recognize_for_video(mp_image,frame_count)frame_count+=1# obtain neccesary data into array for display (using array because there are two hands)ifgesture_recognition_resultisnotNoneandany(gesture_recognition_result.gestures):print("Recognized gestures:")for single_hand_gesture_data ingesture_recognition_result.gestures:gesture_name=single_hand_gesture_data[0].category_namecurrent_gestures.append(gesture_name)for single_hand_handedness_data ingesture_recognition_result.handedness:hand_name=single_hand_handedness_data[0].category_namecurrent_handedness.append(hand_name)for single_hand_score_data ingesture_recognition_result.gestures:score=single_hand_score_data[0].scorecurrent_score.append(round(score,2))# display classified gesture data on framesy_pos=image.shape[0]-70for x inrange(len(current_gestures)):ifcurrent_handedness[x]!="Left":txt=current_handedness[x]+": "+current_gestures[x]+" "+str(current_score[x])ifcurrent_gestures[x]=="supination":cv2.putText(image,txt, (image.shape[1] - 400, y_pos), cv2.FONT_HERSHEY_SIMPLEX, 1, (218,10,3), 2, cv2.LINE_AA)print(txt)break else:cv2.putText(image,txt, (image.shape[1] - 400, y_pos), cv2.FONT_HERSHEY_SIMPLEX, 1, (37,245,252), 2, cv2.LINE_AA)print(txt)break# displaying frame dataimage=ResizeWithAspectRatio(image,height=800)image=cv2.putText(image,"Frame {}".format(frame_count), (10,50),cv2.QT_FONT_NORMAL,1, (0,0,255),1,cv2.LINE_AA )# Resize to original dimension before writingresized_frame=cv2.resize(image, (640, 480))writer.write(resized_frame)cv2.imshow('MediaPipe Hands',image)ifcv2.waitKey(1) &0xFF==ord('q'):breakcap.release()writer.release()cv2.destroyAllWindows()
Conclusion
In this tutorial, we covered the steps to capture gesture data, train a custom gesture recognition model using MediaPipe, and integrate it for video playback control. This can be expanded with additional gestures and more advanced models for better accuracy.
Hope you enjoy this tutorial and happy coding with MediaPipe!
Remember to replace placeholders such as `'path_to_your_video.mp4'` with actual paths relevant to your environment. This tutorial assumes a basic understanding of Python and familiarity with machine learning concepts.