MDEV-39092 Copy Aria data and logs as part of backup#4971
MDEV-39092 Copy Aria data and logs as part of backup#4971mariadb-andrzejjarzabek wants to merge 14 commits into
Conversation
|
|
4ef94be to
c143ff2
Compare
efbc62b to
c77278b
Compare
e89625e to
06fd556
Compare
4d6a19c to
d86a6a1
Compare
c6f5119 to
9ebb23e
Compare
8565956 to
04e3bc2
Compare
14ca552 to
c9429eb
Compare
d35cd47 to
824afeb
Compare
2dfc033 to
01a069c
Compare
| backup_target target; | ||
| #ifndef _WIN32 | ||
| const int datadir_fd; | ||
| #endif |
There was a problem hiding this comment.
Why is there a declaration of datadir_fd in the first place? It turns out that it is never being assigned, only passed to openat(2). This seems to rely on zero-initialization and some undefined behaviour. For example, in Linux, AT_FDCWD is defined as -100.
There was a problem hiding this comment.
I noticed later that this data member is being initialized from a call to open(…, O_DIRECTORY) in the constructor. I think that this is bad practice; we should allow the singleton object to be initialized statically.
It is unclear when if ever the directory handle would be closed. I suspect that we would hold the handle open until the server is shut down.
There was a problem hiding this comment.
datadir_fd is initialized in the Aria_backup constructor.
There was a problem hiding this comment.
Is it a reasonable trade-off to have a descriptor permanently opened for backups for the purpose of just one plugin? Should each plugin that performs the backup keep an open descriptor to the data directory? When adding another bit of functionality (not backup) to any part of the server or a plugin, with its own class/module/translation unit, which needs to open files in the data directory, should it also maintain a file descriptor to the data directory? There may be a case for defining an API for the server to maintain a single descriptor and make it available for other code, but given that there isn't one, I would argue that opening a descriptor for the duration of the backup and closing it on backup end. A backup isn't a fine-grained operation which we expect to be performed multiple times per second (or even minute), but we expect that many hours may pass between successive backups. And every backup will typically open hundreds or thousands of files, so saving the time and I/O of one open/close is not significant and it's not clear that maintaining an additional open descriptor the whole time the server is running is is balanced out by that.
In any case I propose putting that discussion off for a possible future improvement, where we could discuss defining an API to make a data directory descriptor available to the whole server.
0de6368 to
8b1c81b
Compare
| case BACKUP_PHASE_NO_DML_NON_TRANS: | ||
| /* FIXME: Would be better to selectively purge only the tables we need. */ | ||
| tc_purge(); | ||
| tdc_purge(true); | ||
| return 0; |
There was a problem hiding this comment.
The comment is missing a reference to a ticket where this performance problem will be fixed. As far as I understand, we only need such cleanup for ENGINE=Aria and ENGINE=MyISAM. We don’t want unnecessary disruption of ENGINE=InnoDB. The BACKUP_PHASE_NO_COMMIT that will be executed a couple of steps after this one will be disruptive enough.
This introduces a basic driver Sql_cmd_backup, storage engine interfaces, and basic copying of InnoDB, Aria, and MyISAM files. For streaming, we aim to generate streams that are compatible with GNU tar --format=oldgnu, which is also supported by the built-in BSD tar on FreeBSD, macOS as well as Microsoft Windows. That is, to extract files from a backup stream, you can use the standard tar utility of the operating system, instead of anything nonstandard like xbstream or mbstream. TODO: Support partial backup and restore. This should be done by implementing some configuration parameters as well as a predicate that checks if a file name pattern should be included. TODO: Back up ENGINE=RocksDB. TODO: Refactor the crude first implementation of Aria_backup that is based on the work of Andrzej Jarzabek, to copy as many files as concurrently as possible while holding the minimum amount of locks. --- backup_target: A structured data type to represent a target directory. On Microsoft Windows, we must use directory paths because there is no variant of CopyFileEx() that would work on file handles. backup_sink: Wraps a per-thread output stream as well as storage engine specific context. handlerton::backup_start(), handlerton::backup_end(): Invoked at the start or end of a backup phase, in the thread that executes a BACKUP SERVER statement. handlerton::backup_step(): A backup step that can be invoked from multiple threads concurrently, between the execution of the corresponding handlerton::backup_start() and handlerton::backup_end() of the same phase. copy_entire_file(): A file copying service for POSIX systems. copy_file(): A sparse file-copying service for all systems. backup_stream_append_async(): A variant of backup_stream_append() where the source file region is guaranteed to be immutable after the call returns. We must not use Linux sendfile(2) for copying data files that may be modified in place, because it could introduce a race condition between a page write that runs concurrently with a child process that is reading the data from the pipe. InnoDB_backup::context: Backup context, attached to backup_sink so that context can continue to exist between the time a BACKUP SERVER releases all locks and another BACKUP SERVER starts executing, with innodb_backup pointing to the new backup, while the old backup is still being finished. fil_space_t::write_or_backup: Keep track of in-flight page writes and pending backup operation. We must not allow them concurrently, because that could lead into torn pages in the backup. fil_space_t::backup_end: The first page number that is not being backed up (by default 0, to indicate that no backup is in progress). fil_space_t::BACKUP_BATCH_SIZE: The number of preceding pages that will be covered by fil_space_t::backup_end. This is the unit of "page range locking" during InnoDB backup. log_sys.backup: Whether BACKUP SERVER is in progress. The purpose of this is to make BACKUP SERVER prevent the concurrent execution of SET GLOBAL innodb_log_archive=OFF or SET GLOBAL innodb_log_file_size when innodb_log_archive=OFF. log_sys.archived_checkpoint: Keep track of the earliest available checkpoint, corresponding to log_sys.archived_lsn. This reflects SET GLOBAL innodb_log_recovery_start (which is settable now), for incremental backup. buf_flush_list_space(): Check for concurrent backup before writing each page. This is inefficient, but this function may be invoked from multiple threads concurrently, and it cannot be changed easily, especially for fil_crypt_thread().
8b1c81b to
d23446e
Compare
| static my_bool copy_misc_files(const backup_target *target, | ||
| const backup_sink *sink) | ||
| { | ||
| Dir_scan datadir(".", MYF(MY_WANT_STAT)); | ||
| if (datadir.is_error()) | ||
| return true; | ||
| std::unordered_set<std::string> ensured_dirs; | ||
| int error= datadir.for_each([target, sink, &ensured_dirs](const fileinfo &fi) |
There was a problem hiding this comment.
As I pointed out in this comment in #4817, executing potentially failing operations in a constructor is not a good design. This must be refactored in the style of fffd4b6. Inline iteration (range for using st_::span or std::span) should be more efficient than lambda functions, which cannot always be inlined by compilers. In this function we even have two nested lambda function iterations, on two different Dir_scan.
| struct backup_sink; | ||
|
|
||
| /** A payload chunk in a sparse file that is being streamed */ | ||
| struct backup_chunk |
There was a problem hiding this comment.
What is the point of adding a forward declaration right before an actual declaration?
| /* RAII wrapper for my_dir() */ | ||
| class Dir_scan | ||
| { | ||
| public: | ||
| explicit Dir_scan(const char* path, myf flags) noexcept | ||
| { | ||
| dir_info= my_dir(path, flags); | ||
| if (!dir_info) | ||
| { | ||
| my_error(ER_CANT_READ_DIR, MYF(0), path, my_errno); | ||
| } | ||
| } | ||
| ~Dir_scan() noexcept | ||
| { | ||
| my_dirend(dir_info); | ||
| } | ||
| bool is_error() const noexcept | ||
| { | ||
| return !dir_info; | ||
| } |
There was a problem hiding this comment.
Please, do not execute potentially failing operations in a constructor.
Why is the destructor potentially invoking my_dirend(nullptr)?
What purpose does explicit have on a constructor that takes more than 1 parameter?
| MY_DIR *dir_info {nullptr}; | ||
| }; | ||
|
|
||
| std::string build_path(const char *base_path, const char *filename) noexcept; |
There was a problem hiding this comment.
It would be better to refactor my_dir() so that it can return a directory handle, on which openat may be invoked. We generally try to avoid operations on std::string because there already are enough issues with heap fragmentation.
| if (phase == BACKUP_PHASE_NO_DDL) | ||
| fail= copy_misc_files(&target_phase->target, &target_phase->sink); | ||
| if (fail) | ||
| break; |
There was a problem hiding this comment.
This is unnecessarily testing fail when the first if does not hold.
| const backup_sink *sink) | ||
| { | ||
| Dir_scan datadir(".", MYF(MY_WANT_STAT)); | ||
| if (datadir.is_error()) |
There was a problem hiding this comment.
In the embedded server library libmariadbd, the data directory is not necessarily the current directory.
| static bool is_misc_file(const char* filename) | ||
| { | ||
| size_t filename_len= strlen(filename); | ||
| if (filename_len < ext_len) | ||
| return false; | ||
| const char *file_ext = filename + filename_len - ext_len; | ||
| return match_misc_ext(file_ext) || is_db_opt(filename, filename_len); | ||
| } |
There was a problem hiding this comment.
In 4769a43, Aria_backup::is_db_file() is using strrchr and strcmp for these, and it is omitting a . prefix from the suffix string that are being compared to. This feels unnecessarily complicated. Can you demonstrate that this approach results in smaller or more efficient code on at least one ISA?
| #if 1 // FIXME: invoke these only for Aria, MyISAM, CSV but not others | ||
| case BACKUP_PHASE_NO_DML_NON_TRANS: | ||
| /* FIXME: Would be better to selectively purge only the tables we need. | ||
| To be fixed in MDEV-39987. */ | ||
| tc_purge(); | ||
| tdc_purge(true); | ||
| break; | ||
| #endif |
There was a problem hiding this comment.
This is only partly addressing my previous comment about this. MDEV-39987 does not make it clear that this performance issue mainly affects other storage engines, such as ENGINE=InnoDB or ENGINE=RocksDB, which do not need any table handles to be closed.
Move copying of common files and MyISAM files out of Aria plugim Fix incorrect handling of non-default aria_log_dir_path.
in the same phase.
d23446e to
d4b2048
Compare
An interim solution with some room for optimization: