We are senior students in Computer and Artificial Intelligence Engineering at Hacettepe University, working under the guidance of our supervisor, Fuat Akal. Our project aims to contribute to the fields of demographic prediction and NLP models, inspiring further advancements in these areas.
Adem Baran Orhan
2200765034
Umut Kalman
21946256
İhsan Çağatay Eraslan
21827335
Abstract
Social media platforms are significant repositories of public opinion and societal trends in the modern digital era. However, the absence of comprehensive demographic information about social media users poses a challenge in analyzing these trends effectively. Our project, "Profiling Social Media Users," aims to bridge this gap by leveraging natural language processing (NLP) techniques to predict demographic traits such as age, gender, race, and education level.
The core problem addressed by our project is the accurate categorization of social media users based on their demographic attributes. To achieve this, we propose developing a transformer that integrates various user-related factors, providing a nuanced and holistic analysis. Our methodology includes multiple stages: data collection from platforms like Twitter using web scraping tools, meticulous data labeling, feature engineering, and applying NLP techniques. The model is trained and fine-tuned iteratively to optimize its performance, yielding predictions that offer valuable insights into demographic patterns among social media users.
The impact of our project is twofold: academically, it contributes to the field of demographic prediction by setting a standard for accuracy and fairness; practically, it provides a tool for researchers and organizations to understand better and engage with their audiences. Future directions for this project include expanding the dataset to cover more diverse social media platforms, enhancing the model's robustness, and exploring real-time demographic predictions. We also plan to scale up the training dataset, add more demographic categories, improve the model architecture, and publish academic papers to share our findings with the broader research community. In summary, "Profiling Social Media Users" presents a comprehensive solution to a significant challenge in social media analysis, advancing academic research and offering practical tools for societal benefit.
Link to Promotional Video 🎞️
Download Our Poster
Tübitak 2209-A
Our project has been selected for support by the Scientific and Technological Research Council of Turkey (TÜBİTAK) under the 2209-A University Students Research Projects Support Program. This support from TÜBİTAK not only reinforces the importance of our research but also enables us to further explore and understand the implications of social media behavior on demographic predictions. We look forward to progressing in our research and sharing our findings with the academic and scientific community.
Apify for Data Scraping
We opted to utilize Apify Twitter Scraper, a web scraping tool explicitly designed for extracting data from X(Twitter). Apify provides a user-friendly interface and robust functionality, enabling efficient and customizable data retrieval from the X(Twitter) platform. Its features include support for advanced search queries, data parsing, and scheduling, making it an ideal choice for our project's data collection needs.
BERTweet
We used BERTweet to classify tweets based on user characteristics such as race, gender, sexual orientation, education, and age. BERTweet is a variant of BERT, pre-trained specifically for English tweets. It uses the same Transformer-based architecture as BERT, but the main differences are its pre-training procedure and the corpus on which it is trained.
Once we had the data, we preprocessed the tweets according to BERTweet's protocol, translating emojis into text strings, converting user mentions and web/URL links into unique tokens, and applying Byte Pair Encoding for sub-word tokenization. We adapted the pre-trained BERTweet model to our specific task using a supervised learning approach for the training part. Our model learned to associate given tweets with their corresponding user profile categories. After the model training, we used our fine-tuned BERTweet model to classify incoming tweets based on users' race, gender, sexual orientation, education, and age.
We have deployed our model using the Hugging Face Spaces platform and the Gradio framework. Hugging Face Spaces is a hosted platform that allows developers to create, share, and collaborate on interactive machine learning applications easily. With the Gradio framework, we can provide a user-friendly interface for demonstrating and engaging with our model.
All the source code and the data set resides in our github repository.
Link to Github Repository
You may also access our shared drive folder, which includes the models and the usernames of the Twitter users we have collected.
Link to Drive Folder