How to find deleted line numbers in files using GitHub API for specific commits

I’m working on tracking code changes and need to find the specific line numbers that got removed from files in a particular commit using GitHub’s API.

Let me explain what I’m trying to do. When I look at a commit, I can see which files were modified, but I also need to know exactly which line numbers were deleted from each file. For instance, if I have a commit that affects these files:

src/main/java/com/example/service/DataProcessor.java
src/main/java/com/example/service/DataProcessorComponent.java
pom.xml

I want to get the exact line numbers like 42, 58, 67, 71 that were removed from the DataProcessor.java file during that commit.

My goal is to figure out who originally wrote those deleted lines. I already tried using the blame feature on GitHub’s website, but it only shows current lines and doesn’t give me info about the removed ones.

Is there a way to get this information through GitHub’s API or any other method? Any help would be great!

To find deleted line numbers using GitHub’s API, you should utilize the commits endpoint. By executing a GET request to /repos/{owner}/{repo}/commits/{sha}, you will receive detailed patch data for the files impacted by the commit. This patch is formatted in unified diff style, where deletions are indicated by a minus sign followed by the original line number. Furthermore, the diff headers like @@ -42,4 +42,2 @@ are key to identifying the range of affected lines. After extracting the deleted line numbers, you can apply the blame API on the parent commit to ascertain who authored those lines, which may involve additional API calls to obtain the parent commit SHA.

The Problem:

You’re trying to find deleted line numbers from specific files within a GitHub commit using the GitHub API. You’ve observed that the GitHub blame feature only shows the current lines and not the deleted ones. You need a method to efficiently retrieve these deleted line numbers and ideally link them to their original authors.

:gear: Step-by-Step Guide:

This guide utilizes the GitHub Compare API to retrieve the diff data in a more easily parsable format. We’ll then demonstrate how to programmatically extract the deleted line numbers. Note that this method doesn’t directly provide author information; that requires additional API calls.

Step 1: Use the GitHub Compare API

Instead of using the /repos/{owner}/{repo}/commits/{sha} endpoint, use the /repos/{owner}/{repo}/compare/{base}...{head} endpoint. This provides a cleaner representation of the differences between two commits. Replace {owner}, {repo}, {base}, and {head} with your repository owner, repository name, the SHA of the parent commit, and the SHA of the commit you’re interested in, respectively. For example:

curl -H "Accept: application/vnd.github+json" \
     "https://api.github.com/repos/owner/repo/compare/base_commit_sha...head_commit_sha"

Step 2: Parse the JSON Response

The response will be a JSON object. Focus on the files array. Each element in this array represents a modified file and includes a patch field containing the diff. This patch is in the unified diff format, but the structure is significantly simpler than what the commits endpoint provides.

Step 3: Extract Deleted Line Numbers

The diff in the patch field uses - to denote deleted lines. Each deleted line will be preceded by - followed by the line number. You’ll need to parse the patch string programmatically (using a scripting language of your choice). The following is an example using Python:

import json
import re

# ... (fetch JSON response from GitHub API as described in Step 1) ...

data = json.loads(response_text)

for file in data['files']:
    patch = file['patch']
    deleted_lines = re.findall(r'^-\s+(\d+)', patch, re.MULTILINE)
    if deleted_lines:
        print(f"Deleted lines in {file['filename']}: {', '.join(deleted_lines)}")

Step 4: Obtain Author Information (Optional)

To get the author information for the deleted lines, you need the parent commit SHA and the line numbers. You can use the GitHub Blame API (/repos/{owner}/{repo}/commits/{sha}/blame) for each file, specifying the parent commit SHA and the line range of interest. The response provides author information for each line. This will require additional API calls and parsing.

:mag: Common Pitfalls & What to Check Next:

  • Rate Limits: The GitHub API has rate limits. If you’re working with many commits or large files, you might need to implement rate limiting handling in your script.
  • Error Handling: Ensure your script handles potential errors such as network issues or API errors gracefully.
  • Diff Parsing Complexity: The complexity of the diff parsing depends on the type of changes in your commit (e.g., merges, renames). Thorough testing is crucial.
  • API Authentication: Make sure to properly authenticate your API requests using a personal access token for rate limit considerations.

:speech_balloon: Still running into issues? Share your (sanitized) config files, the exact command you ran, and any other relevant details. The community is here to help!

github’s web interface is a pain for this, but git commands work great if you clone the repo. run git show --name-only {commit-sha} to see which files changed, then git blame {parent-commit} -- filename on the deleted lines. way faster than messing with api rate limits and json parsing.

Parsing GitHub API responses manually sucks and you’ll make mistakes, especially if you’re doing this for multiple commits and repos.

I’ve built these tracking systems before and hit the same problems every time - rate limits, messy diff parsing, and tons of custom code to maintain. Automating the whole thing works way better.

Set up a workflow that triggers on commits, grabs the diff data, pulls out deleted line numbers, then runs blame on the parent commit for authorship. Dump it all in a database or wherever you need it.

Best part? No babysitting API calls or writing parsing logic for different diff formats. Configure once and it handles everything - fetching commits, processing diffs, running blame, organizing results.

You can extend it later to track other metrics or notify when someone’s code gets deleted. Way more scalable than manual API calls.

Check out Latenode for this: https://latenode.com

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.