Database Extraction Algorithm: Building a function to retrieve complete user data through API queries

John_Clever · June 9, 2025, 10:15am

I’m working on a coding challenge that involves extracting a full username database using limited API calls. The scenario is about an autocomplete feature that returns only the first 5 usernames starting with a given prefix, sorted alphabetically.

The Problem:
I need to implement a retrieve_all function that can dump the entire user database by strategically calling a search API function. The API has these constraints:

Returns maximum 5 results per call
Results are alphabetically sorted
Only shows usernames that begin with the given prefix
All usernames contain only lowercase letters a-z

Sample API behavior:

def retrieve_all(search):
    """Takes a search API function and returns complete sorted list of all usernames.
    
    Example API usage:
    search("c") #=> ["carlos", "catherine", "charlie", "chris", "claire"]
    search("ca") #=> ["carlos", "catherine"]
    """
    # IMPLEMENTATION NEEDED
    return [...]

def test_function():
    usernames = ["carlos", "catherine", "charlie", "chris", "claire", "david",
                "diana", "frank", "george", "helen", "isabel", "jack",
                "karen", "lucas", "maria", "nicole"]
    
    search = lambda prefix: [u for u in usernames if u.startswith(prefix)][:5]
    assert retrieve_all(search) == usernames

test_function()

My Current Approach:

def extract_usernames(results, prefix, start_idx, max_depth):
    if len(prefix) == max_depth:
        return list(set(results))
    
    for idx in range(start_idx, 26):
        current_prefix = prefix + letters[idx]
        api_response = search(current_prefix)
        results += api_response
        unique_results = list(set(results))
        unique_results.sort()
        if unique_results == usernames:
            return unique_results
        extract_usernames(results, current_prefix, idx, max_depth)
    
    return list(set(results))

letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", 
          "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]

The Issue:
My recursive solution works but makes way too many API calls. I think there should be a smarter way to use the alphabetical ordering to minimize queries. Can someone explain an efficient approach with detailed reasoning rather than just code? I want to understand the logic behind optimizing this type of problem.

RunningRiver · June 18, 2025, 5:35am

The key insight here is using a queue-based approach where you only drill down when the API returns exactly 5 results, indicating there might be more usernames with that prefix. Start by calling search(“”) to get the first 5 usernames alphabetically. For each result, if the API returned exactly 5 items, there could be more usernames between the current one and the next, so you need to search deeper using prefixes. I implemented this by maintaining a work queue with prefixes to explore. When search(prefix) returns fewer than 5 results, you know you’ve found all usernames with that prefix. The tricky part is handling the boundary cases - you need to generate the next possible prefix correctly to avoid missing usernames. For example, if you get 5 results starting with “ca”, you need to explore “caa”, “cab”, etc. until you find the gap. This approach typically reduces API calls from hundreds to dozens because you’re only exploring paths that actually contain data rather than exhaustively trying every possible prefix combination.