# Simple PDF Upload & Voter Extraction

## Overview
This service extracts voter details, booth numbers, and street names directly from PDF electoral rolls **without saving any data to the database**. It returns the extracted information as a JSON response for immediate use.

## Key Features
- ✅ **No Database Operations** - Only extracts and returns data
- ✅ **Auto-detect Booth Number** - Extracts from page header
- ✅ **Auto-detect Street Name** - Extracts from section header
- ✅ **Multiple Voter Patterns** - Supports various electoral roll formats
- ✅ **OCR Support** - Handles image-based PDFs
- ✅ **Immediate Response** - No background jobs or queues

## API Endpoint

### POST `/api/pdf-upload/extract`

**Purpose:** Extract voters, booth number, and street name from PDF without database operations

**Request:**
- **Method:** POST
- **Content-Type:** multipart/form-data
- **Body:**
  - `pdf_file`: PDF file (max 20MB)

**Response:**
```json
{
  "success": true,
  "message": "PDF processed successfully",
  "data": {
    "booth_number": "2",
    "street_name": "BAJANAI MADATHU STREET, VENNILA NAGAR",
    "voters": [
      {
        "serial_number": "1",
        "voter_id": "ABC1234567",
        "name": "John Doe",
        "age": 35,
        "gender": "M",
        "year_of_birth": 1990,
        "relation_name": "Father Name",
        "source_line": 25
      }
    ],
    "metadata": {
      "total_voters": 150,
      "text_length": 12450,
      "processing_time": "3.2s",
      "extraction_method": "text-parsing"
    }
  }
}
```

## Header Extraction Patterns

### Booth/Part Number
The service automatically detects booth numbers from headers using these patterns:
- `Part No.:2`
- `Part No: 2`
- `Booth No: 2`
- `Booth Number: 2`

### Street Name
Extracts street names from section headers:
- `Section No and Name 1-BAJANAI MADATHU STREET, VENNILA NAGAR, Puducherry-605013`
- `1-BAJANAI MADATHU STREET, VENNILA NAGAR`

The service extracts the street name portion and removes location suffixes (Puducherry, pin codes, etc.)

## Voter Extraction Patterns

### Pattern 1: Full Record in Single Line
```
1 ABC1234567 JOHN DOE 35 M
```
Extracts: Serial, EPIC, Name, Age, Gender

### Pattern 2: EPIC and Name with Context
```
ABC1234567 JOHN DOE
Father Name: FATHER NAME
35 Male
```
Extracts: EPIC, Name, Relation, Age, Gender (from current or next line)

### Pattern 3: Multi-line Records
The service looks ahead to next lines if age/gender/relation not found in current line

## Usage Examples

### Using Postman

1. **Create New Request**
   - Method: `POST`
   - URL: `http://your-domain/api/pdf-upload/extract`

2. **Set Body**
   - Select `form-data`
   - Add key: `pdf_file`
   - Type: `File`
   - Select your PDF file

3. **Send Request**

### Using cURL

```bash
curl -X POST http://your-domain/api/pdf-upload/extract \
  -F "pdf_file=@/path/to/electoral_roll.pdf"
```

### Using PHP

```php
$curl = curl_init();

$file = new CURLFile('/path/to/electoral_roll.pdf', 'application/pdf', 'voters.pdf');

curl_setopt_array($curl, [
    CURLOPT_URL => 'http://your-domain/api/pdf-upload/extract',
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_POSTFIELDS => ['pdf_file' => $file]
]);

$response = curl_exec($curl);
curl_close($curl);

$result = json_decode($response, true);

echo "Booth Number: " . $result['data']['booth_number'] . "\n";
echo "Street Name: " . $result['data']['street_name'] . "\n";
echo "Total Voters: " . count($result['data']['voters']) . "\n";
```

### Using JavaScript (Fetch)

```javascript
const formData = new FormData();
formData.append('pdf_file', fileInput.files[0]);

fetch('http://your-domain/api/pdf-upload/extract', {
  method: 'POST',
  body: formData
})
.then(response => response.json())
.then(data => {
  console.log('Booth Number:', data.data.booth_number);
  console.log('Street Name:', data.data.street_name);
  console.log('Voters:', data.data.voters);
});
```

## How It Works

### 1. PDF Upload & Validation
- Validates PDF file (max 20MB)
- Accepts various MIME types (application/pdf, application/octet-stream)
- Checks file integrity

### 2. Text Extraction
- **First Attempt:** Direct text extraction from PDF
- **Fallback:** OCR if PDF is image-based (uses Tesseract)
- Processes all pages

### 3. Header Parsing
- Scans first 50 lines for booth/part number
- Identifies street name from section headers
- Handles various header formats

### 4. Voter Extraction
- Line-by-line parsing
- Pattern matching for EPIC numbers (3 letters + 7 digits)
- Multi-line record support
- Age and gender detection

### 5. Response Generation
- Returns structured JSON
- Includes metadata (processing time, method used)
- No database writes

## Response Fields

### Root Level
- `booth_number` (string|null) - Extracted booth/part number
- `street_name` (string|null) - Extracted street name
- `voters` (array) - Array of voter objects
- `metadata` (object) - Processing information

### Voter Object
- `voter_id` (string) - EPIC number (e.g., "ABC1234567")
- `name` (string) - Voter's name
- `age` (int|null) - Age
- `gender` (string|null) - "M", "F", or "O"
- `year_of_birth` (int|null) - Calculated from age
- `serial_number` (string|null) - Serial number if found
- `relation_name` (string|null) - Father/Husband/Wife name
- `source_line` (int) - Line number in PDF

### Metadata Object
- `total_voters` (int) - Number of voters extracted
- `text_length` (int) - Total characters extracted
- `processing_time` (string) - Time taken (e.g., "3.2s")
- `extraction_method` (string) - "text-parsing" or "ocr"

## Error Handling

### Validation Errors
```json
{
  "success": false,
  "message": "Validation failed",
  "errors": {
    "pdf_file": ["The file must be a valid PDF document."]
  }
}
```

### Extraction Errors
```json
{
  "success": false,
  "message": "Failed to process PDF: No text could be extracted",
  "code": 500
}
```

## Common Issues & Solutions

### Issue: "No text could be extracted from the PDF"
**Causes:**
- PDF is corrupted
- PDF is password-protected
- PDF contains only images without text layer

**Solutions:**
- Ensure PDF is valid and not corrupted
- Remove password protection
- Service will automatically try OCR for image-based PDFs
- If OCR fails, check Tesseract installation: `tesseract --version`

### Issue: Booth number or street name not detected
**Causes:**
- Non-standard header format
- Header text in unexpected location

**Solutions:**
- Check logs to see what was extracted
- Headers should contain keywords: "Part", "Booth", "Section"
- Contact support to add new pattern support

### Issue: Voters not detected or incomplete data
**Causes:**
- Non-standard voter record format
- Poor OCR quality (for image-based PDFs)

**Solutions:**
- Ensure PDF has clear, readable text
- For image-based PDFs, use high resolution (300 DPI+)
- Check sample voters in logs

## Performance Considerations

### Processing Time
- **Text-based PDFs:** 1-3 seconds
- **Image-based PDFs (OCR):** 5-15 seconds depending on pages
- Time scales with number of pages and voters

### File Size Limits
- **Max Size:** 20MB
- **Recommended:** Under 10MB for faster processing
- Large files may timeout on shared hosting

### Optimization Tips
1. Use text-based PDFs when possible (faster than OCR)
2. Reduce PDF size by optimizing images
3. Extract only necessary pages
4. Consider pagination for large files

## Differences from Other Import Methods

| Feature | pdf-upload/extract | pdf-import/upload | image-import/upload |
|---------|-------------------|-------------------|---------------------|
| **Input** | PDF file | PDF file | Image file |
| **Saves to DB** | ❌ No | ✅ Yes | ❌ No |
| **Returns Data** | ✅ Immediate JSON | Job ID (async) | ✅ Immediate JSON |
| **Booth Detection** | ✅ From header | ✅ Complex parsing | ✅ From header |
| **Street Detection** | ✅ From header | ✅ Complex parsing | ❌ Limited |
| **OCR Support** | ✅ Auto-fallback | ✅ Full support | ✅ Primary method |
| **Processing** | Synchronous | Background Job | Synchronous |
| **Use Case** | Quick extraction | Bulk import | Screenshot analysis |

## Technical Details

### Files Structure
```
app/
├── Http/Controllers/
│   └── SimplePdfUploadController.php   # Handles upload & validation
├── Services/
│   └── SimplePdfExtractorService.php   # Extraction & parsing logic
routes/
└── api.php                              # Route definition
```

### Dependencies
- `smalot/pdfparser` - PDF text extraction
- `imagick` PHP extension - Image processing
- `tesseract` - OCR (optional, for image-based PDFs)

### Logging
All operations are logged to Laravel logs:
```bash
tail -f storage/logs/laravel.log
```

Log entries include:
- Upload details
- Extraction method used
- Text length extracted
- Booth/street detection results
- Voter parsing results
- Processing time

## Testing

### Quick Test
```bash
# Test with sample PDF
curl -X POST http://localhost:8000/api/pdf-upload/extract \
  -F "pdf_file=@sample_electoral_roll.pdf" \
  | jq '.'
```

### Verify Response
```bash
# Check booth number
curl -X POST http://localhost:8000/api/pdf-upload/extract \
  -F "pdf_file=@sample.pdf" \
  | jq '.data.booth_number'

# Count voters
curl -X POST http://localhost:8000/api/pdf-upload/extract \
  -F "pdf_file=@sample.pdf" \
  | jq '.data.voters | length'
```

## Best Practices

### For Developers
1. Always check `success` field in response
2. Handle null values for booth_number and street_name
3. Parse voters array safely (may be empty)
4. Use metadata for debugging
5. Implement timeout handling (15-30 seconds recommended)

### For API Users
1. Use text-based PDFs when possible
2. Keep files under 10MB
3. Validate extracted data before use
4. Implement retry logic for transient failures
5. Cache results if processing same file multiple times

### For Production
1. Set appropriate timeout values
2. Monitor processing times
3. Implement rate limiting
4. Add request queuing for concurrent uploads
5. Consider async processing for large files

## Future Enhancements

Potential improvements:
- [ ] Support for multiple booth numbers in single PDF
- [ ] Confidence scores for OCR results
- [ ] Table structure detection
- [ ] House number extraction
- [ ] Support for regional languages
- [ ] Batch processing multiple PDFs
- [ ] WebSocket for progress updates on large files

## Version History

### v1.0.0 (Current)
- Initial release
- Booth number extraction from header
- Street name extraction from section header
- Multiple voter pattern support
- OCR fallback for image-based PDFs
- No database operations

## Support

### Troubleshooting Steps
1. Check Laravel logs: `storage/logs/laravel.log`
2. Verify Tesseract installation: `tesseract --version`
3. Test with simple text-based PDF first
4. Check PDF file integrity
5. Review extraction patterns in logs

### Contact
For issues or feature requests, check the logs and provide:
- PDF file sample (if possible)
- Error message from response
- Laravel log entries
- Expected vs actual output
