Malay Media Audio Dataset

Home » Case Study » Malay Media Audio Dataset

Project Overview:

Objective

The objective of our project, “Malay Media Audio Dataset,” is to develop a comprehensive audio dataset that can be used for training advanced machine learning models in voice recognition, natural language processing, and media analysis. This dataset specifically focuses on the Malay language, providing a rich source of linguistic data.

Scope

Our scope involves the collection and annotation of Malay language audio files from diverse sources. This includes media clips, interviews, and other spoken-word recordings. The audio files are annotated with detailed metadata, including speaker identity, speech context, and technical attributes.

Sources

Movies and TV Shows: Scenes featuring characters speaking in AAVE.
Interviews: Conversations and interviews with African American individuals where AAVE is spoken.
Online Videos: Social media clips, YouTube videos, and online content showcasing AAVE usage.

Data Collection Metrics

Total Audio Recordings: 18,000 recordings
Media Clips: 7,000
Interviews: 6,000
Other Spoken-Word Recordings: 5,000

Annotation Process

Stages

Speaker Identification: Annotate each audio recording with the identity of the speaker(s) and their role in the media.
Contextual Tagging: Tag each recording with context information like topic, setting, and emotional tone.
Technical Annotation: Include technical data such as audio quality, duration, and background noise levels.

Annotation Metrics

Audio Recordings with Speaker and Contextual Labels: 18,000
Technical Annotations: 18,000

Quality Assurance

Stages

Rigorous validation process to ensure the accuracy of annotations.
Regular checks for audio quality and clarity.
Adherence to data privacy regulations and ethical guidelines.

QA Metrics

Audio Quality Checks: 3,000 recordings
Annotation Accuracy Review: 2,000 recordings

Conclusion

The Malay Media Audio Dataset is an invaluable resource for the development of machine learning models that require Malay language audio inputs. With a diverse range of recordings and meticulous annotations, this dataset stands out as a high-quality tool for researchers and developers working in the fields of voice recognition, linguistic analysis, and media studies. Our commitment to data quality and integrity ensures that the dataset is not only comprehensive but also reliable and effective for various applications.