Conversation
📝 WalkthroughWalkthroughAdds sample Ruby methods that register CSV-backed tables for DuckDB connections: a table-valued-function based registrar and a CREATE TABLE + insert variant, plus a benchmark comparing the two approaches using in-memory CSV (StringIO). Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
sample/issue930.rb (2)
40-47: Connection and database are never closed.
conanddbremain open for the lifetime of the process. Either close them explicitly or useDuckDB::Database.openwith a block (the block form inlib/duckdb/database.rbensuresdb.closeis called inensure).♻️ Proposed cleanup
-con = db.connect -con.query('SET threads=1') -con.register_as_table('csv_io', csv_io, csv_options: { headers: true }) -result = con.query('SELECT * FROM csv_io()').to_a - -p result -puts result.first.first == '1' -puts result.first[1] == 'Alice' +DuckDB::Database.open do |db| + con = db.connect + con.query('SET threads=1') + con.register_as_table('csv_io', csv_io, csv_options: { headers: true }) + result = con.query('SELECT * FROM csv_io()').to_a + con.close + + p result + puts result.first.first == '1' + puts result.first[1] == 'Alice' +end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sample/issue930.rb` around lines 40 - 47, The code opens a DuckDB Database and Connection (db and con) but never closes them; either wrap the database creation in DuckDB::Database.open with a block (which ensures db.close) and perform con = db.connect inside that block, or add explicit cleanup by calling con.close and db.close in an ensure/finally to guarantee they are closed even on errors; update the area around con = db.connect and any use of register_as_table / con.query to ensure closure.
10-30:Metrics/MethodLengthpipeline failure — 19 lines vs. the 10-line limit.Either add an inline disable comment or extract the table-function block into a named helper lambda to bring the method within the limit.
♻️ Option: extract block into a named lambda
def register_as_table(name, io, csv_options: {}) + raise ArgumentError, 'csv_options must include headers: true' unless csv_options[:headers] csv = CSV.new(io, **csv_options) headers = csv.first.headers csv.rewind columns = headers.each_with_object({}) { |header, hash| hash[header] = LogicalType::VARCHAR } - tf = DuckDB::TableFunction.create( - name:, - columns: - ) do |_func_info, output| - line = csv.readline - if line - line.each_with_index do |cell, index| - output.set_value(index, 0, cell[1]) - end - 1 - else - 0 - end - end + scanner = ->(_, output) { + line = csv.readline + if line + line.each_with_index { |cell, idx| output.set_value(idx, 0, cell[1]) } + 1 + else + 0 + end + } + tf = DuckDB::TableFunction.create(name:, columns:, &scanner) register_table_function(tf) end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sample/issue930.rb` around lines 10 - 30, The method register_as_table is too long; extract the anonymous table-function block passed to DuckDB::TableFunction.create into a named helper (e.g., row_provider_lambda) so register_as_table returns under the 10-line limit; move the logic that reads csv.readline, iterates cells and calls output.set_value into that lambda (ensuring it closes over csv and headers as needed), then pass the named lambda to DuckDB::TableFunction.create and keep the final register_table_function(tf) call in register_as_table.sample/issue930_benchmark.rb (1)
10-31:register_as_table_with_table_functionduplicatesregister_as_tablefromissue930.rb.The only behavioral difference is the
csv.rewindin theelsebranch. Consider requiringissue930.rband reusing its method, or moving the shared logic to a common module.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sample/issue930_benchmark.rb` around lines 10 - 31, The method register_as_table_with_table_function is duplicating register_as_table from issue930.rb; refactor by either requiring issue930.rb and calling the existing register_as_table to avoid duplication, or extract the shared CSV-to-table-function logic into a common module/class (e.g., a new CsvTableHelper with a method used by both register_as_table and register_as_table_with_table_function) and make the differing behavior (the csv.rewind in the else branch) configurable via an argument or callback; update register_as_table_with_table_function to delegate to the shared implementation (or call register_as_table) so only the small behavioral difference is implemented without copying the whole method.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sample/issue930_benchmark.rb`:
- Line 1: The file header contains a typo "frozen_sring_literal: true"; update
the magic comment to the correct "frozen_string_literal: true" (fix the
misspelled token "sring" to "string") so the top-level frozen string literal
pragma is applied correctly; locate and correct the comment line in the file
(the existing "frozen_sring_literal" occurrence).
- Around line 59-61: The second benchmark silently depends on the first run
having advanced csv_io to EOF; before calling
con.register_as_table_with_create_table('csv_ct', csv_io, ...) or re-running
con.query('SELECT * FROM csv_ct') you must explicitly reset the IO position
(e.g., csv_io.rewind or csv_io.seek(0)) so the CSV is read from the start; add
that reset between the first run and the block that uses csv_io again to ensure
register_as_table_with_create_table and con.query always see the full CSV input.
- Around line 37-40: The SQL construction in register_as_table_with_create_table
is vulnerable: stop interpolating raw table names (name), column names
(headers), and cell values (cell[1]) directly into execute; instead quote/escape
identifiers (use the DB driver's identifier-quoting API or wrap identifiers in
double quotes and escape embedded quotes) when building the CREATE TABLE and
INSERT column list, and avoid string interpolation for values by using
parameterized queries if the DuckDB client supports bind parameters (or at
minimum escape single quotes in cell values with cell[1]&.gsub("'", "''") before
embedding). Ensure you reference the execute/execute("CREATE OR REPLACE TABLE
..."), headers.map block, and the csv.each/values construction so all three
injection surfaces are fixed consistently.
In `@sample/issue930.rb`:
- Line 1: Fix the malformed magic comment: replace the typo
"frozen_sring_literal: true" with the correct "frozen_string_literal: true" as
the first line of the file so Ruby recognizes it and RuboCop's
Style/FrozenStringLiteralComment passes; locate the existing magic comment (the
token frozen_sring_literal) and correct the spelling to frozen_string_literal.
- Around line 37-43: Replace the literal 30000 with an underscore-separated
numeric literal (30_000) in the csv_data construction and change the
double-quoted SQL string to a single-quoted string; specifically update the
expression that builds csv_data (the 30000.times.map call that appends to
csv_data) to use 30_000 and update the con.query call that invokes "SELECT *
FROM csv_io()" to use a single-quoted 'SELECT * FROM csv_io()' string to satisfy
RuboCop for csv_data, csv_io, db, and con.register_as_table/con.query usages.
- Around line 10-28: The method register_as_table assumes CSV rows are CSV::Row
(headers present) which causes NoMethodError otherwise; ensure headers are
enforced and iteration is robust: when creating CSV (CSV.new(io,
**csv_options)), set csv_options[:headers] = true if not provided, and update
the row/field handling in the DuckDB::TableFunction block (the csv.first.headers
and the loop using cell[1]) to treat each yielded cell as a [header,value] pair
consistently (or normalize to value via cell.is_a?(Array) ? cell[1] : cell),
referencing register_as_table, csv_options, CSV.new, csv.first.headers, and
output.set_value to locate the changes.
---
Duplicate comments:
In `@sample/issue930_benchmark.rb`:
- Around line 10-13: The method register_as_table_with_table_function currently
calls csv.first.headers which will raise if csv_options lacks headers: true;
update this method to ensure headers are enabled before accessing headers by
either merging a default (e.g. csv_options = {headers: true}.merge(csv_options))
when constructing CSV.new or explicitly checking csv_options[:headers] (or
csv.headers?) before calling csv.first.headers and handling the non-headers
case; refer to the register_as_table_with_table_function method, the csv
variable, and csv_options to locate and apply the guard/fix.
---
Nitpick comments:
In `@sample/issue930_benchmark.rb`:
- Around line 10-31: The method register_as_table_with_table_function is
duplicating register_as_table from issue930.rb; refactor by either requiring
issue930.rb and calling the existing register_as_table to avoid duplication, or
extract the shared CSV-to-table-function logic into a common module/class (e.g.,
a new CsvTableHelper with a method used by both register_as_table and
register_as_table_with_table_function) and make the differing behavior (the
csv.rewind in the else branch) configurable via an argument or callback; update
register_as_table_with_table_function to delegate to the shared implementation
(or call register_as_table) so only the small behavioral difference is
implemented without copying the whole method.
In `@sample/issue930.rb`:
- Around line 40-47: The code opens a DuckDB Database and Connection (db and
con) but never closes them; either wrap the database creation in
DuckDB::Database.open with a block (which ensures db.close) and perform con =
db.connect inside that block, or add explicit cleanup by calling con.close and
db.close in an ensure/finally to guarantee they are closed even on errors;
update the area around con = db.connect and any use of register_as_table /
con.query to ensure closure.
- Around line 10-30: The method register_as_table is too long; extract the
anonymous table-function block passed to DuckDB::TableFunction.create into a
named helper (e.g., row_provider_lambda) so register_as_table returns under the
10-line limit; move the logic that reads csv.readline, iterates cells and calls
output.set_value into that lambda (ensuring it closes over csv and headers as
needed), then pass the named lambda to DuckDB::TableFunction.create and keep the
final register_table_function(tf) call in register_as_table.
| execute("CREATE OR REPLACE TABLE #{name} (#{headers.map { |h| "#{h} VARCHAR" }.join(', ')})") | ||
| csv.each do |row| | ||
| values = row.map { |cell| "'#{cell[1]}'" }.join(', ') | ||
| execute("INSERT INTO #{name} VALUES (#{values})") |
There was a problem hiding this comment.
SQL injection on table name, column names, and cell values.
Three distinct injection surfaces in register_as_table_with_create_table:
- Line 37 — table identifier:
#{name}is interpolated directly into both DDL and DML with no quoting. - Line 37 — column identifiers: CSV header names are also interpolated unquoted.
- Lines 39-40 — values:
'#{cell[1]}'wraps values in single quotes but applies no escaping. A CSV cell containingO'Brienproduces a syntax error; a cell containing'; DROP TABLE csv_ct; --executes arbitrary SQL.
Even in sample scripts, this pattern propagates dangerous habits. Quote identifiers and use parameterized queries or at least escape values:
🛡️ Proposed fix
def register_as_table_with_create_table(name, io, csv_options: {})
csv = CSV.new(io, **csv_options)
headers = csv.first.headers
csv.rewind
- execute("CREATE OR REPLACE TABLE #{name} (#{headers.map { |h| "#{h} VARCHAR" }.join(', ')})")
+ quoted_name = "\"#{name.gsub('"', '""')}\""
+ col_defs = headers.map { |h| "\"#{h.gsub('"', '""')}\" VARCHAR" }.join(', ')
+ execute("CREATE OR REPLACE TABLE #{quoted_name} (#{col_defs})")
csv.each do |row|
- values = row.map { |cell| "'#{cell[1]}'" }.join(', ')
- execute("INSERT INTO #{name} VALUES (#{values})")
+ placeholders = row.fields.map.with_index(1) { |_, i| "$#{i}" }.join(', ')
+ query("INSERT INTO #{quoted_name} VALUES (#{placeholders})", *row.fields)
end
endNote: Verify whether
DuckDB::Connection#queryaccepts positional bind parameters in this form for your ruby-duckdb version; if not, at minimum escape single quotes withcell[1]&.gsub("'", "''").
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sample/issue930_benchmark.rb` around lines 37 - 40, The SQL construction in
register_as_table_with_create_table is vulnerable: stop interpolating raw table
names (name), column names (headers), and cell values (cell[1]) directly into
execute; instead quote/escape identifiers (use the DB driver's
identifier-quoting API or wrap identifiers in double quotes and escape embedded
quotes) when building the CREATE TABLE and INSERT column list, and avoid string
interpolation for values by using parameterized queries if the DuckDB client
supports bind parameters (or at minimum escape single quotes in cell values with
cell[1]&.gsub("'", "''") before embedding). Ensure you reference the
execute/execute("CREATE OR REPLACE TABLE ..."), headers.map block, and the
csv.each/values construction so all three injection surfaces are fixed
consistently.
| start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
| con.register_as_table_with_create_table('csv_ct', csv_io, csv_options: { headers: true }) | ||
| con.query('SELECT * FROM csv_ct').to_a |
There was a problem hiding this comment.
Second benchmark silently depends on the first run's csv.rewind having fired.
The csv.rewind inside the else branch (line 26-27) is called only when DuckDB exhausts all rows with no LIMIT. If the query is ever changed to return a partial result, csv_io stays at EOF and register_as_table_with_create_table will process zero rows without any error. Reset the IO position explicitly between runs for robustness:
🛡️ Proposed fix
end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC)
+csv_io.rewind
+
start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC) | |
| con.register_as_table_with_create_table('csv_ct', csv_io, csv_options: { headers: true }) | |
| con.query('SELECT * FROM csv_ct').to_a | |
| end_time = Process.clock_gettime(Process::CLOCK_MONOTONIC) | |
| csv_io.rewind | |
| start_time2 = Process.clock_gettime(Process::CLOCK_MONOTONIC) | |
| con.register_as_table_with_create_table('csv_ct', csv_io, csv_options: { headers: true }) | |
| con.query('SELECT * FROM csv_ct').to_a |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sample/issue930_benchmark.rb` around lines 59 - 61, The second benchmark
silently depends on the first run having advanced csv_io to EOF; before calling
con.register_as_table_with_create_table('csv_ct', csv_io, ...) or re-running
con.query('SELECT * FROM csv_ct') you must explicitly reset the IO position
(e.g., csv_io.rewind or csv_io.seek(0)) so the CSV is read from the start; add
that reset between the first run and the block that uses csv_io again to ensure
register_as_table_with_create_table and con.query always see the full CSV input.
sample/issue930.rb
Outdated
| def register_as_table(name, io, csv_options: {}) | ||
| csv = CSV.new(io, **csv_options) | ||
| headers = csv.first.headers | ||
| csv.rewind | ||
| columns = headers.each_with_object({}) { |header, hash| hash[header] = LogicalType::VARCHAR } | ||
| tf = DuckDB::TableFunction.create( | ||
| name:, | ||
| columns: | ||
| ) do |_func_info, output| | ||
| line = csv.readline | ||
| if line | ||
| line.each_with_index do |cell, index| | ||
| output.set_value(index, 0, cell[1]) | ||
| end | ||
| 1 | ||
| else | ||
| 0 | ||
| end | ||
| end |
There was a problem hiding this comment.
headers: true is an undocumented, unenforced requirement that crashes without it.
Line 12 calls csv.first.headers: CSV#shift returns a CSV::Row when headers: true, but a plain Array otherwise. Array has no #headers method, so omitting headers: true from csv_options raises NoMethodError. The cell[1] access on line 22 also silently depends on this — iterating a CSV::Row yields [header, value] pairs, but iterating a plain row Array yields plain values, making cell[1] index into a string rather than fetch the field value.
Add a guard at the method boundary:
🛡️ Proposed fix
def register_as_table(name, io, csv_options: {})
+ raise ArgumentError, 'csv_options must include headers: true' unless csv_options[:headers]
csv = CSV.new(io, **csv_options)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def register_as_table(name, io, csv_options: {}) | |
| csv = CSV.new(io, **csv_options) | |
| headers = csv.first.headers | |
| csv.rewind | |
| columns = headers.each_with_object({}) { |header, hash| hash[header] = LogicalType::VARCHAR } | |
| tf = DuckDB::TableFunction.create( | |
| name:, | |
| columns: | |
| ) do |_func_info, output| | |
| line = csv.readline | |
| if line | |
| line.each_with_index do |cell, index| | |
| output.set_value(index, 0, cell[1]) | |
| end | |
| 1 | |
| else | |
| 0 | |
| end | |
| end | |
| def register_as_table(name, io, csv_options: {}) | |
| raise ArgumentError, 'csv_options must include headers: true' unless csv_options[:headers] | |
| csv = CSV.new(io, **csv_options) | |
| headers = csv.first.headers | |
| csv.rewind | |
| columns = headers.each_with_object({}) { |header, hash| hash[header] = LogicalType::VARCHAR } | |
| tf = DuckDB::TableFunction.create( | |
| name:, | |
| columns: | |
| ) do |_func_info, output| | |
| line = csv.readline | |
| if line | |
| line.each_with_index do |cell, index| | |
| output.set_value(index, 0, cell[1]) | |
| end | |
| 1 | |
| else | |
| 0 | |
| end | |
| end |
🧰 Tools
🪛 GitHub Actions: Linter
[error] 10-10: Metrics/MethodLength: Method has too many lines. [19/10]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sample/issue930.rb` around lines 10 - 28, The method register_as_table
assumes CSV rows are CSV::Row (headers present) which causes NoMethodError
otherwise; ensure headers are enforced and iteration is robust: when creating
CSV (CSV.new(io, **csv_options)), set csv_options[:headers] = true if not
provided, and update the row/field handling in the DuckDB::TableFunction block
(the csv.first.headers and the loop using cell[1]) to treat each yielded cell as
a [header,value] pair consistently (or normalize to value via cell.is_a?(Array)
? cell[1] : cell), referencing register_as_table, csv_options, CSV.new,
csv.first.headers, and output.set_value to locate the changes.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
sample/issue930.rb (1)
35-44: Database and connection are never explicitly closed.For a sample that others will use as a reference, the idiomatic block form of
DuckDB::Database.open(which auto-closes viaensure) better demonstrates safe resource management.♻️ Suggested idiomatic rewrite
-db = DuckDB::Database.open - -csv_data = "id,name,age\n1,Alice,30\n2,Bob,25\n3,Charlie,35" -csv_data += 30_000.times.map { |i| "\n#{i + 4},Name#{i + 4},#{rand(0..100)}" }.join -csv_io = StringIO.new(csv_data) - -con = db.connect -con.query('SET threads=1') -con.register_as_table('csv_io', csv_io, csv_options: { headers: true }) -result = con.query('SELECT * FROM csv_io()').to_a - -p result -puts result.first.first == '1' -puts result.first[1] == 'Alice' +csv_data = "id,name,age\n1,Alice,30\n2,Bob,25\n3,Charlie,35" +csv_data += 30_000.times.map { |i| "\n#{i + 4},Name#{i + 4},#{rand(0..100)}" }.join +csv_io = StringIO.new(csv_data) + +DuckDB::Database.open do |db| + con = db.connect + con.query('SET threads=1') + con.register_as_table('csv_io', csv_io, csv_options: { headers: true }) + result = con.query('SELECT * FROM csv_io()').to_a + + p result + puts result.first.first == '1' + puts result.first[1] == 'Alice' +end🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sample/issue930.rb` around lines 35 - 44, Replace the manual opens with the idiomatic block forms so the Database and Connection are auto-closed: wrap the existing logic that uses DuckDB::Database.open and con = db.connect inside DuckDB::Database.open do |db| ... end and db.connect do |con| ... end (keeping your con.query('SET threads=1'), con.register_as_table('csv_io', csv_io, ...), and the SELECT call inside) so both db and con are closed via ensure; alternatively, if you prefer not to use block form, explicitly call con.close and db.close after the query.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@sample/issue930.rb`:
- Around line 18-29: The closure captures and mutates the CSV object (csv) but
register_as_table is being used while DuckDB may be running with multiple
threads, causing concurrent invocations and races; update register_as_table to
assert the DuckDB thread count is 1 at runtime (e.g., query PRAGMA threads or
equivalent via the active connection and raise a clear error if >1), and add a
short inline comment/docs note near the register_as_table implementation
explaining the SET threads=1 precondition so callers know to set threads=1
before registering a mutable CSV-backed table.
---
Duplicate comments:
In `@sample/issue930.rb`:
- Line 12: The code calls csv.first.headers and later uses cell[1] assuming CSV
rows have headers, but there is no guard that csv was parsed with headers: true;
add a check in the method to detect/require header-enabled CSV parsing (e.g.,
inspect csv.first's type or csv_options) and raise or convert to headered rows
if missing, then use csv.first.headers and access fields with header-aware
indexing (cell[1]) safely; update the function that reads the CSV (the
invocation that uses csv.first.headers and later cell[1]) so it either enforces
csv_options[:headers] == true or transforms the plain Array rows into
CSV::Row-like headered rows before accessing headers and cell[1].
---
Nitpick comments:
In `@sample/issue930.rb`:
- Around line 35-44: Replace the manual opens with the idiomatic block forms so
the Database and Connection are auto-closed: wrap the existing logic that uses
DuckDB::Database.open and con = db.connect inside DuckDB::Database.open do |db|
... end and db.connect do |con| ... end (keeping your con.query('SET
threads=1'), con.register_as_table('csv_io', csv_io, ...), and the SELECT call
inside) so both db and con are closed via ensure; alternatively, if you prefer
not to use block form, explicitly call con.close and db.close after the query.
| ) do |_func_info, output| | ||
| line = csv.readline | ||
| if line | ||
| line.each_with_index do |cell, index| | ||
| output.set_value(index, 0, cell[1]) | ||
| end | ||
| 1 | ||
| else | ||
| csv.rewind | ||
| 0 | ||
| end | ||
| end |
There was a problem hiding this comment.
csv shared across threads — SET threads=1 is a required but unenforced precondition.
The csv object is captured in the closure and mutated on every block invocation. DuckDB's default thread count is greater than 1, so parallel scans will call this block concurrently, causing races on the shared CSV state (wrong rows, crashes). The demo script correctly calls SET threads=1 before registering the table, but register_as_table neither documents nor enforces this requirement.
At minimum, add a guard comment (or a runtime check):
🛡️ Suggested documentation / guard
def register_as_table(name, io, csv_options: {}) # rubocop:disable Metrics/MethodLength, Metrics/AbcSize
+ # NOTE: The CSV IO object is not thread-safe. Ensure DuckDB is configured
+ # with SET threads=1 (or a thread-safe IO subclass) before calling this,
+ # otherwise concurrent table scans will produce data races.
csv = CSV.new(io, **csv_options)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sample/issue930.rb` around lines 18 - 29, The closure captures and mutates
the CSV object (csv) but register_as_table is being used while DuckDB may be
running with multiple threads, causing concurrent invocations and races; update
register_as_table to assert the DuckDB thread count is 1 at runtime (e.g., query
PRAGMA threads or equivalent via the active connection and raise a clear error
if >1), and add a short inline comment/docs note near the register_as_table
implementation explaining the SET threads=1 precondition so callers know to set
threads=1 before registering a mutable CSV-backed table.
Summary by CodeRabbit
New Features
Samples