Skip to content

Commit 02e833e

Browse files
authored
Merge pull request #504 from mrexodia/per-resource-last-update
Implement per-resource last_update timestamps
2 parents d19e2ad + b3a8241 commit 02e833e

4 files changed

Lines changed: 348 additions & 25 deletions

File tree

CHANGES.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,11 @@ Unreleased
77
optional attachment downloads, and per-repository incremental checkpoints.
88
- Add pull request review backups with ``--pull-reviews`` and one-time
99
incremental backfill for existing backups.
10+
- Store incremental ``last_update`` checkpoints per repository resource instead
11+
of using one global checkpoint for the whole output directory. Existing
12+
backups use the legacy global checkpoint as a migration fallback, and the
13+
legacy file is removed once existing issue/pull backups have resource
14+
checkpoints (#62).
1015
- Add ``--token-from-gh`` to read authentication from ``gh auth token``.
1116

1217

README.rst

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -347,15 +347,19 @@ About pull request reviews
347347

348348
Use ``--pull-reviews`` with ``--pulls`` to include GitHub pull request review metadata under each pull request's ``review_data`` key. Reviews are separate from review comments: ``--pull-comments`` backs up inline review comments via ``comment_data`` and regular PR conversation comments via ``comment_regular_data``, while ``--pull-reviews`` backs up review state, submitted time, commit ID, and the top-level review body.
349349

350-
``--pull-reviews`` is included in ``--all``. Incremental backups use a per-repository checkpoint at ``repositories/{repo}/pulls/reviews_last_update``. If ``--pull-reviews`` is enabled on an existing incremental backup, the first run performs a one-time backfill for pull request reviews so older PRs are not skipped by the existing repository checkpoint. Existing ``comment_data``, ``comment_regular_data`` and ``commit_data`` fields are preserved when only review data is being added.
350+
``--pull-reviews`` is included in ``--all``. Incremental backups use a per-repository checkpoint at ``repositories/{repo}/pulls/reviews_last_update``. If ``--pull-reviews`` is enabled on an existing incremental backup, the first run performs a one-time backfill for pull request reviews so older PRs are not skipped by the existing pull request checkpoint. Existing ``comment_data``, ``comment_regular_data`` and ``commit_data`` fields are preserved when only review data is being added.
351351

352352

353353
Incremental Backup
354354
------------------
355355

356-
Using (``-i, --incremental``) will only request new data from the API **since the last run (successful or not)**. e.g. only request issues from the API since the last run.
356+
Using (``-i, --incremental``) will only request new data from the API **since the last successful resource backup**. e.g. only request issues from the API since the last issue backup for that repository.
357357

358-
This means any blocking errors on previous runs can cause a large amount of missing data in backups.
358+
Incremental checkpoints for issue and pull request API backups are stored per resource in that repository's backup directory (for example ``repositories/{repo}/issues/last_update``, ``repositories/{repo}/pulls/last_update`` or ``starred/{owner}/{repo}/pulls/last_update``). Older versions stored a single global ``last_update`` file in the output directory root. During migration, the legacy global checkpoint is used as a fallback only for resource directories that already contain backup data but do not yet have their own checkpoint. New repositories or newly enabled resources with no existing data get a full backup instead of inheriting an unrelated global checkpoint.
359+
360+
After all existing issue and pull request resource directories have per-resource checkpoints, the legacy global ``last_update`` file is removed automatically.
361+
362+
This means any blocking errors on previous runs can cause missing data in backups for the affected repository resource.
359363

360364
Using (``--incremental-by-files``) will request new data from the API **based on when the file was modified on filesystem**. e.g. if you modify the file yourself you may miss something.
361365

@@ -368,7 +372,7 @@ Known blocking errors
368372

369373
Some errors will block the backup run by exiting the script. e.g. receiving a 403 Forbidden error from the Github API.
370374

371-
If the incremental argument is used, this will result in the next backup only requesting API data since the last blocked/failed run. Potentially causing unexpected large amounts of missing data.
375+
If the incremental argument is used, per-resource checkpoints are only advanced after that resource's backup work completes. A blocking error can still abort the overall run, but repositories and resources that were not processed will keep their previous checkpoints.
372376

373377
It's therefore recommended to only use the incremental argument if the output/result is being actively monitored, or complimented with periodic full non-incremental runs, to avoid unexpected missing data in a regular backup runs.
374378

github_backup/github_backup.py

Lines changed: 146 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1928,26 +1928,138 @@ def filter_repositories(args, unfiltered_repositories):
19281928
return repositories
19291929

19301930

1931+
INCREMENTAL_LAST_UPDATE_FILENAME = "last_update"
1932+
INCREMENTAL_RESOURCE_DIRECTORIES = ("issues", "pulls")
1933+
1934+
1935+
def get_repository_checkpoint_time(repository):
1936+
timestamps = [
1937+
timestamp
1938+
for timestamp in (repository.get("updated_at"), repository.get("pushed_at"))
1939+
if timestamp
1940+
]
1941+
if timestamps:
1942+
return max(timestamps)
1943+
1944+
return time.strftime("%Y-%m-%dT%H:%M:%SZ", time.localtime())
1945+
1946+
1947+
def resource_backup_exists(resource_cwd):
1948+
if not os.path.isdir(resource_cwd):
1949+
return False
1950+
1951+
ignored_names = {
1952+
INCREMENTAL_LAST_UPDATE_FILENAME,
1953+
PULL_REVIEWS_LAST_UPDATE_FILENAME,
1954+
}
1955+
for name in os.listdir(resource_cwd):
1956+
if name in ignored_names or name.endswith(".temp"):
1957+
continue
1958+
return True
1959+
1960+
return False
1961+
1962+
1963+
def read_legacy_last_update(args, output_directory):
1964+
if not args.incremental:
1965+
return None, None
1966+
1967+
last_update_path = os.path.join(output_directory, INCREMENTAL_LAST_UPDATE_FILENAME)
1968+
if os.path.exists(last_update_path):
1969+
return last_update_path, open(last_update_path).read().strip()
1970+
1971+
return last_update_path, None
1972+
1973+
1974+
def read_resource_last_update(args, resource_cwd, legacy_last_update=None):
1975+
if not args.incremental:
1976+
return None
1977+
1978+
last_update_path = os.path.join(resource_cwd, INCREMENTAL_LAST_UPDATE_FILENAME)
1979+
if os.path.exists(last_update_path):
1980+
return open(last_update_path).read().strip()
1981+
1982+
if legacy_last_update and resource_backup_exists(resource_cwd):
1983+
return legacy_last_update
1984+
1985+
return None
1986+
1987+
1988+
def write_resource_last_update(args, resource_cwd, repository):
1989+
if not args.incremental:
1990+
return
1991+
1992+
mkdir_p(resource_cwd)
1993+
last_update_path = os.path.join(resource_cwd, INCREMENTAL_LAST_UPDATE_FILENAME)
1994+
open(last_update_path, "w").write(get_repository_checkpoint_time(repository))
1995+
1996+
1997+
def iter_incremental_resource_dirs(output_directory):
1998+
repositories_dir = os.path.join(output_directory, "repositories")
1999+
if os.path.isdir(repositories_dir):
2000+
for repository_name in os.listdir(repositories_dir):
2001+
repo_cwd = os.path.join(repositories_dir, repository_name)
2002+
if not os.path.isdir(repo_cwd):
2003+
continue
2004+
for resource_name in INCREMENTAL_RESOURCE_DIRECTORIES:
2005+
yield os.path.join(repo_cwd, resource_name)
2006+
2007+
starred_dir = os.path.join(output_directory, "starred")
2008+
if os.path.isdir(starred_dir):
2009+
for owner_name in os.listdir(starred_dir):
2010+
owner_cwd = os.path.join(starred_dir, owner_name)
2011+
if not os.path.isdir(owner_cwd):
2012+
continue
2013+
for repository_name in os.listdir(owner_cwd):
2014+
repo_cwd = os.path.join(owner_cwd, repository_name)
2015+
if not os.path.isdir(repo_cwd):
2016+
continue
2017+
for resource_name in INCREMENTAL_RESOURCE_DIRECTORIES:
2018+
yield os.path.join(repo_cwd, resource_name)
2019+
2020+
2021+
def has_unmigrated_incremental_resources(output_directory):
2022+
for resource_cwd in iter_incremental_resource_dirs(output_directory):
2023+
last_update_path = os.path.join(
2024+
resource_cwd, INCREMENTAL_LAST_UPDATE_FILENAME
2025+
)
2026+
if resource_backup_exists(resource_cwd) and not os.path.exists(
2027+
last_update_path
2028+
):
2029+
return True
2030+
2031+
return False
2032+
2033+
2034+
def remove_legacy_last_update_if_migrated(
2035+
args, output_directory, legacy_last_update_path
2036+
):
2037+
if not args.incremental or not legacy_last_update_path:
2038+
return
2039+
if not os.path.exists(legacy_last_update_path):
2040+
return
2041+
if has_unmigrated_incremental_resources(output_directory):
2042+
logger.info(
2043+
"Keeping legacy global last_update until all existing issue/pull "
2044+
"backups have per-resource checkpoints"
2045+
)
2046+
return
2047+
2048+
os.remove(legacy_last_update_path)
2049+
logger.info(
2050+
"Removed legacy global last_update after migrating incremental checkpoints"
2051+
)
2052+
2053+
19312054
def backup_repositories(args, output_directory, repositories):
19322055
logger.info("Backing up repositories")
19332056
repos_template = "https://{0}/repos".format(get_github_api_host(args))
2057+
legacy_last_update_path, legacy_last_update = read_legacy_last_update(
2058+
args, output_directory
2059+
)
2060+
incremental_resource_work_attempted = False
19342061

1935-
if args.incremental:
1936-
last_update_path = os.path.join(output_directory, "last_update")
1937-
if os.path.exists(last_update_path):
1938-
args.since = open(last_update_path).read().strip()
1939-
else:
1940-
args.since = None
1941-
else:
1942-
args.since = None
1943-
1944-
last_update = "0000-00-00T00:00:00Z"
19452062
for repository in repositories:
1946-
if repository.get("updated_at") and repository["updated_at"] > last_update:
1947-
last_update = repository["updated_at"]
1948-
elif repository.get("pushed_at") and repository["pushed_at"] > last_update:
1949-
last_update = repository["pushed_at"]
1950-
19512063
if repository.get("is_gist"):
19522064
repo_cwd = os.path.join(output_directory, "gists", repository["id"])
19532065
elif repository.get("is_starred"):
@@ -2010,18 +2122,32 @@ def backup_repositories(args, output_directory, repositories):
20102122
no_prune=args.no_prune,
20112123
)
20122124
if args.include_issues or args.include_everything:
2125+
incremental_resource_work_attempted = True
2126+
issue_cwd = os.path.join(repo_cwd, "issues")
2127+
args.since = read_resource_last_update(
2128+
args, issue_cwd, legacy_last_update
2129+
)
20132130
backup_issues(args, repo_cwd, repository, repos_template)
2131+
write_resource_last_update(args, issue_cwd, repository)
20142132

20152133
if args.include_pulls or args.include_everything:
2134+
incremental_resource_work_attempted = True
2135+
pulls_cwd = os.path.join(repo_cwd, "pulls")
2136+
args.since = read_resource_last_update(
2137+
args, pulls_cwd, legacy_last_update
2138+
)
20162139
backup_pulls(args, repo_cwd, repository, repos_template)
2140+
write_resource_last_update(args, pulls_cwd, repository)
20172141

20182142
if args.include_discussions or args.include_everything:
20192143
backup_discussions(args, repo_cwd, repository)
20202144

20212145
if args.include_milestones or args.include_everything:
20222146
backup_milestones(args, repo_cwd, repository, repos_template)
20232147

2024-
if args.include_security_advisories or (args.include_everything and not repository.get("private", False)):
2148+
if args.include_security_advisories or (
2149+
args.include_everything and not repository.get("private", False)
2150+
):
20252151
backup_security_advisories(args, repo_cwd, repository, repos_template)
20262152

20272153
if args.include_labels or args.include_everything:
@@ -2045,11 +2171,10 @@ def backup_repositories(args, output_directory, repositories):
20452171
logger.info(f"Skipping remaining resources for {repository['full_name']}")
20462172
continue
20472173

2048-
if args.incremental:
2049-
if last_update == "0000-00-00T00:00:00Z":
2050-
last_update = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.localtime())
2051-
2052-
open(last_update_path, "w").write(last_update)
2174+
if incremental_resource_work_attempted:
2175+
remove_legacy_last_update_if_migrated(
2176+
args, output_directory, legacy_last_update_path
2177+
)
20532178

20542179

20552180
def _repository_owner_name(repository):

0 commit comments

Comments
 (0)