# ArXiv Social Good AI Paper Fetcher An automated system for discovering and cataloging research papers related to AI bias, fairness, and social good from arXiv.org. This tool uses GPT-4o to intelligently filter papers for social impact and automatically updates a target repository with newly discovered relevant research. ## 🎯 Features - **Intelligent Paper Detection**: Uses GPT-4o to analyze paper titles and abstracts for social good and fairness relevance - **Automated Daily Updates**: Runs daily via GitHub Actions to fetch the latest papers - **Historical Paper Collection**: Can fetch and process papers from the past 2 years - **GitHub Integration**: Automatically updates target repository README with new findings - **Comprehensive Filtering**: Focuses on AI/ML categories most likely to contain social impact research - **Social Good Focus**: Identifies bias and fairness research across healthcare, education, criminal justice, and more - **Reverse Chronological Order**: Always maintains newest papers at the top of README for easy access - **Unlimited Historical Mode**: Configurable limits for processing tens of thousands of papers - **Scalable Architecture**: Support for research-grade data collection with custom memory and API limits ## 🔧 Setup & Configuration ### Prerequisites - Python 3.11+ - OpenAI API key with GPT-4o access - GitHub Personal Access Token with repository write permissions ### Environment Variables Configure the following environment variables: ```bash OPENAI_API_KEY=your_openai_api_key_here TARGET_REPO_TOKEN=your_github_token_here TARGET_REPO_NAME=username/repository-name ``` For GitHub Actions, these should be configured as repository secrets: - `OPENAI_API_KEY` - `TARGET_REPO_TOKEN` ### Installation 1. Clone this repository: ```bash git clone https://github.com/YurenHao0426/PaperFetcher.git cd PaperFetcher ``` 2. Install dependencies: ```bash pip install -r requirements.txt ``` ## 🚀 Usage ### Daily Paper Fetching Fetch papers from the last 24 hours: ```bash python scripts/fetch_papers.py ``` Fetch papers from the last N days: ```bash FETCH_DAYS=7 python scripts/fetch_papers.py ``` ### Historical Paper Fetching Fetch papers from the past 2 years (default limits: 50,000 papers): ```bash FETCH_MODE=historical python scripts/fetch_papers.py ``` **Unlimited Historical Mode** - Configure custom limits for large-scale data collection: ```bash # Small scale (for testing) MAX_HISTORICAL_PAPERS=1000 MAX_PAPERS_PER_CATEGORY=200 FETCH_MODE=historical python scripts/fetch_papers.py # Medium scale (recommended for daily use) MAX_HISTORICAL_PAPERS=5000 MAX_PAPERS_PER_CATEGORY=1000 FETCH_MODE=historical python scripts/fetch_papers.py # Large scale (for research purposes) MAX_HISTORICAL_PAPERS=50000 MAX_PAPERS_PER_CATEGORY=10000 FETCH_MODE=historical python scripts/fetch_papers.py # Ultra large scale (use with caution) MAX_HISTORICAL_PAPERS=100000 MAX_PAPERS_PER_CATEGORY=20000 MAX_CONCURRENT=100 FETCH_MODE=historical python scripts/fetch_papers.py ``` ### Testing Test the daily fetching functionality: ```bash python scripts/test_daily_fetch.py ``` Test the historical fetching functionality: ```bash python scripts/test_historical_fetch.py ``` Test the Social Good prompt effectiveness: ```bash python scripts/test_social_good_prompt.py ``` Test the reverse chronological ordering: ```bash python scripts/test_reverse_chronological.py ``` Test the unlimited historical mode: ```bash python scripts/test_unlimited_historical.py ``` ### Debugging If the system completes too quickly or you suspect no papers are being fetched, use the debug script: ```bash python scripts/debug_fetch.py ``` This will show detailed information about: - arXiv API connectivity - OpenAI API connectivity - Number of papers fetched at each step - Sample papers and filtering results ### Parallel Processing The system now supports parallel processing of OpenAI requests for faster filtering: ```bash # Test parallel vs sequential performance python scripts/test_parallel_processing.py ``` **Performance optimization options:** ```bash # Enable/disable parallel processing USE_PARALLEL=true python scripts/fetch_papers.py # Control concurrent requests (default: 16 for daily, 25 for historical) MAX_CONCURRENT=20 python scripts/fetch_papers.py # Disable parallel processing for debugging USE_PARALLEL=false python scripts/fetch_papers.py ``` **Expected speedup:** 3-10x faster processing depending on the number of papers and network conditions. ## 🚀 Unlimited Historical Mode The system now supports processing tens of thousands of papers for comprehensive historical analysis. ### Configuration Options | Scale | Papers | Per Category | Concurrent | Use Case | |-------|--------|--------------|------------|----------| | **Small** | 1,000 | 200 | 10 | Testing & Development | | **Medium** | 5,000 | 1,000 | 25 | Daily Research | | **Large** | 50,000 | 10,000 | 50 | Comprehensive Analysis | | **Ultra** | 100,000+ | 20,000+ | 100+ | Research-Grade Mining | ### Performance Estimates - **Small Scale**: ~15 minutes, ~$1 API cost - **Medium Scale**: ~1 hour, ~$5 API cost - **Large Scale**: ~4 hours, ~$50 API cost - **Ultra Scale**: ~15+ hours, ~$100+ API cost ### Memory Requirements - **4GB RAM**: Up to 20,000 papers - **8GB RAM**: Up to 50,000 papers - **16GB+ RAM**: 100,000+ papers ### Usage Examples ```bash # Test configuration (recommended first run) MAX_HISTORICAL_PAPERS=1000 \ MAX_PAPERS_PER_CATEGORY=200 \ MAX_CONCURRENT=10 \ FETCH_MODE=historical python scripts/fetch_papers.py # Research configuration (comprehensive historical data) MAX_HISTORICAL_PAPERS=50000 \ MAX_PAPERS_PER_CATEGORY=10000 \ MAX_CONCURRENT=50 \ FETCH_MODE=historical python scripts/fetch_papers.py ``` ### Important Considerations ⚠️ **Before running large-scale operations:** - Test with small configurations first - Monitor API usage and costs - Ensure sufficient memory and stable network - Consider running during off-peak hours - Large runs may take several hours to complete ## 🤖 GitHub Actions The project includes automated GitHub Actions workflows: ### Daily Schedule - Runs daily at 12:00 UTC - Fetches papers from the last 24 hours - Updates target repository automatically ### Manual Trigger - Can be triggered manually from GitHub Actions tab - Supports both `daily` and `historical` modes - Configurable number of days for daily mode ## 📋 Paper Categories The system searches these arXiv categories for relevant papers: - `cs.AI` - Artificial Intelligence - `cs.CL` - Computation and Language - `cs.CV` - Computer Vision and Pattern Recognition - `cs.LG` - Machine Learning - `cs.NE` - Neural and Evolutionary Computing - `cs.RO` - Robotics - `cs.IR` - Information Retrieval - `cs.HC` - Human-Computer Interaction - `stat.ML` - Machine Learning (Statistics) ## 🎯 Relevance Criteria Papers are considered relevant if they discuss: - **Bias and fairness** in AI/ML systems with societal impact - **Algorithmic fairness** in healthcare, education, criminal justice, hiring, or finance - **Demographic bias** affecting marginalized or underrepresented groups - **Data bias** and its social consequences - **Ethical AI** and responsible AI deployment in society - **AI safety** and alignment with human values and social welfare - **Bias evaluation, auditing, or mitigation** in real-world applications - **Representation and inclusion** in AI systems and datasets - **Social implications** of AI bias (e.g., perpetuating inequality) - **Fairness** in recommendation systems, search engines, or content moderation ## 📁 Project Structure ``` PaperFetcher/ ├── scripts/ │ ├── fetch_papers.py # Main fetching script (with parallel support) │ ├── test_daily_fetch.py # Daily fetching test │ ├── test_historical_fetch.py # Historical fetching test │ ├── test_parallel_processing.py # Parallel processing performance test │ ├── test_improved_fetch.py # Improved fetching logic test │ ├── test_social_good_prompt.py # Social Good prompt testing │ ├── test_reverse_chronological.py # Reverse chronological order testing │ ├── test_unlimited_historical.py # Unlimited historical mode testing │ └── debug_fetch.py # Debug and troubleshooting script ├── .github/ │ └── workflows/ │ └── daily_papers.yml # GitHub Actions workflow ├── requirements.txt # Python dependencies └── README.md # This file ``` ## 🔍 How It Works 1. **Paper Retrieval**: Queries arXiv API for papers in relevant CS categories 2. **Date Filtering**: Filters papers based on submission/update dates 3. **AI Analysis**: Uses GPT-4o to analyze each paper's title and abstract for social good relevance 4. **Social Impact Assessment**: Evaluates papers for bias, fairness, and societal implications 5. **Repository Update**: Adds relevant papers to target repository's README in reverse chronological order 6. **Version Control**: Commits changes with descriptive commit messages ## ⚙️ Configuration Options ### Environment Variables | Variable | Description | Default | Required | |----------|-------------|---------|----------| | `OPENAI_API_KEY` | OpenAI API key for GPT-4o | - | Yes | | `TARGET_REPO_TOKEN` | GitHub token for repository access | - | Yes | | `TARGET_REPO_NAME` | Target repository (owner/repo format) | `YurenHao0426/awesome-llm-bias-papers` | No | | `FETCH_MODE` | Mode: `daily` or `historical` | `daily` | No | | `FETCH_DAYS` | Number of days to fetch (daily mode) | `1` | No | | `MAX_HISTORICAL_PAPERS` | Maximum papers for historical mode | `50000` | No | | `MAX_PAPERS_PER_CATEGORY` | Maximum papers per arXiv category | `10000` | No | | `USE_PARALLEL` | Enable parallel processing | `true` | No | | `MAX_CONCURRENT` | Maximum concurrent requests | `16` (daily), `50` (historical) | No | ## 🐛 Troubleshooting ### Common Issues 1. **API Rate Limits**: The system includes retry logic and respects API limits 2. **Large Historical Fetches**: Historical mode processes up to 5000 papers and may take time 3. **Token Permissions**: Ensure GitHub token has write access to target repository ### Debugging Enable debug logging by modifying the logging level in the script: ```python logging.basicConfig(level=logging.DEBUG) ``` ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch 3. Make your changes 4. Test thoroughly 5. Submit a pull request ## 📄 License This project is open source and available under the MIT License. ## 📧 Contact For questions or issues, please open a GitHub issue or contact the maintainer. --- **Note**: This tool is designed for academic research purposes. Please respect arXiv's usage policies and OpenAI's API terms of service.