Autorefresh for storm Noaa #1682

Megha18jain · 2025-10-30T09:31:37Z

No description provided.

gemini-code-assist · 2025-10-30T09:31:53Z

Summary of Changes

Hello @Megha18jain, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes an automated and robust pipeline for ingesting NOAA storm event data. It covers the entire lifecycle from fetching raw data via FTP, efficiently organizing it into smaller, processable units, to cleaning and transforming it into a structured format. The inclusion of a manifest file ensures that this critical data source can be regularly updated, providing timely and accurate storm information.

Highlights

New Data Import Pipeline: Introduces a complete pipeline for importing NOAA storm event data, encompassing download, sharding, and processing stages.
Automated Data Download: A new Python script (download_script.py) is added to automatically download and unzip NOAA storm data from an FTP server, featuring retry mechanisms for robust downloads.
Data Sharding for Efficiency: A sharding script (shard_noaa_data.py) is included to break down large raw data files into smaller, manageable shards based on STATE_FIPS, optimizing subsequent processing.
Data Cleaning and Transformation: The process_noaa_data.py script handles the cleaning, transformation, and geoid resolution of the sharded data, preparing it for import. It also retrieves an API key from GCS for geocoding.
Automated Refresh Schedule: A manifest.json file is added, defining the import specifications for the Storm_NOAA data and setting up a weekly cron schedule for automatic data refreshes.
Schema Definition: A new template MCF file (storm_noaa.tmcf) is provided to define the structured schema for storm episodes and events, including various properties like injuries, damages, and locations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new data import pipeline for NOAA storm data, including scripts for downloading, sharding, and processing. The implementation is comprehensive and includes good features like parallel processing and download retries. However, I've identified several critical issues related to data correctness and schema mapping, such as typos in the TMCF and CSV header, a bug that would cause crop_damage to be stored as a tuple, and an issue preventing the script from correctly skipping already downloaded files. Addressing these is crucial for the pipeline to function correctly. I've also included some suggestions to improve code robustness and maintainability.

gemini-code-assist · 2025-10-30T09:34:38Z

statvar_imports/storm_noaa/download_script.py

+                local_gz_path = os.path.join(data_dir, filename)
+                local_csv_path = local_gz_path[:-3]  # Remove .gz extension
+
+                if local_csv_path in existing_files:


The check for existing files is incorrect. local_csv_path is a full file path (e.g., noaa_storm_data/some_file.csv), while existing_files contains only file basenames (e.g., {'some_file.csv'}). The in check will therefore always evaluate to false, causing files to be re-downloaded on every run. You should check against the filename without the .gz extension.

Suggested change

if local_csv_path in existing_files:

if filename[:-3] in existing_files:

gemini-code-assist · 2025-10-30T09:34:38Z

statvar_imports/storm_noaa/output/storm_noaa.tmcf

+Node: E:storm_noaa->E0
+dcid: C:storm_noaa->stormEpisode
+typeOf: dcs:StormEpisode
+name: C:storm_noaa->stromEpisodeName


There is a typo in the column name: stromEpisodeName should be stormEpisodeName. This will cause a mismatch with the data and prevent correct schema mapping.

name: C:storm_noaa->stormEpisodeName

gemini-code-assist · 2025-10-30T09:34:38Z

statvar_imports/storm_noaa/output/storm_noaa.tmcf

+indirectInjuries C:storm_noaa->indirectInjuries
+directDeaths C:storm_noaa->directDeaths


These lines are missing a colon (:) between the property and the column mapping. This will cause a parsing error when processing the TMCF file.

indirectInjuries: C:storm_noaa->indirectInjuries directDeaths: C:storm_noaa->directDeaths

gemini-code-assist · 2025-10-30T09:34:38Z

statvar_imports/storm_noaa/output/storm_noaa.tmcf

+maxClassification: C:storm_noaa->maxClassification
+lengthTraveled: C:storm_noaa->lengthTraveled
+width: C:storm_noaa->width
+lengthTraveled: C:storm_noaa->lengthTraveled


The property lengthTraveled is duplicated. It is already defined on line 28. Please remove this redundant line.

gemini-code-assist · 2025-10-30T09:34:39Z

statvar_imports/storm_noaa/process_noaa_data.py

+                        startLocation, endLocation = start_loc_str, end_loc_str
+
+                    property_damage = cost_to_int(row.get("DAMAGE_PROPERTY", "0"))
+                    crop_damage = cost_to_int(row.get("DAMAGE_CROPS", "0")),


A trailing comma at the end of this line causes crop_damage to be a tuple (e.g., (1000,)) instead of an integer. This will result in incorrectly formatted data in the output CSV. Please remove the comma.

Suggested change

crop_damage = cost_to_int(row.get("DAMAGE_CROPS", "0")),

crop_damage = cost_to_int(row.get("DAMAGE_CROPS", "0"))

gemini-code-assist · 2025-10-30T09:34:39Z

statvar_imports/storm_noaa/process_noaa_data.py

+                    property_damage_value = f"[USDollar {property_damage}]" if property_damage else ""
+                    crop_damage_value = f"[USDollar {crop_damage}]" if crop_damage else ""
+
+                    unit = "MilesPerHour" if row.get("MAGNitude_TYPE") else "Inch"


There is a typo in the dictionary key: MAGNitude_TYPE should be MAGNITUDE_TYPE. This will cause row.get("MAGNitude_TYPE") to always return None, leading to the unit variable being incorrectly assigned in many cases.

Suggested change

unit = "MilesPerHour" if row.get("MAGNitude_TYPE") else "Inch"

unit = "MilesPerHour" if row.get("MAGNITUDE_TYPE") else "Inch"

statvar_imports/storm_noaa/process_noaa_data.py

gemini-code-assist · 2025-10-30T09:34:39Z

statvar_imports/storm_noaa/process_noaa_data.py

+    try:
+        # Check if temp dir is empty before trying to remove it
+        if not os.listdir(FLAGS.temp_dir):
+            os.rmdir(FLAGS.temp_dir)
+            logging.info(f"Successfully removed temporary directory {FLAGS.temp_dir}")
+    except OSError as e:
+        logging.error(f"Error removing temporary directory {FLAGS.temp_dir}: {e}")


The logic for cleaning up the temporary directory can be simplified and made more robust. Instead of checking if the directory is empty before removing it (which is prone to race conditions), you can use shutil.rmtree to recursively delete the directory. This is safer and handles cases where some temporary files might not have been deleted due to errors.

Suggested change

try:

# Check if temp dir is empty before trying to remove it

if not os.listdir(FLAGS.temp_dir):

os.rmdir(FLAGS.temp_dir)

logging.info(f"Successfully removed temporary directory {FLAGS.temp_dir}")

except OSError as e:

logging.error(f"Error removing temporary directory {FLAGS.temp_dir}: {e}")

try:

shutil.rmtree(FLAGS.temp_dir)

logging.info(f"Successfully removed temporary directory {FLAGS.temp_dir}")

except OSError as e:

logging.error(f"Error removing temporary directory {FLAGS.temp_dir}: {e}")

gemini-code-assist · 2025-10-30T09:34:39Z

statvar_imports/storm_noaa/shard_noaa_data.py

+                if shard_key not in shard_files:
+                    shard_files[shard_key] = open(out_path, 'w', newline='')
+                    writer = csv.writer(shard_files[shard_key])
+                    writer.writerow(header)
+                    shard_files[shard_key].writer = writer # Attach writer to file handle
+                    logging.info(f"Created shard file: {out_path}")
+
+                shard_files[shard_key].writer.writerow(row)


Attaching a writer attribute to a file object is an unconventional pattern that can be confusing for future maintenance. A more standard and clearer approach would be to store a tuple of (file_handle, writer) in your dictionary. This makes the code easier to understand and maintain.

You would also need to update how you close the files later on line 78 to something like for f, _ in shard_files.values(): f.close().

Suggested change

if shard_key not in shard_files:

shard_files[shard_key] = open(out_path, 'w', newline='')

writer = csv.writer(shard_files[shard_key])

writer.writerow(header)

shard_files[shard_key].writer = writer # Attach writer to file handle

logging.info(f"Created shard file: {out_path}")

shard_files[shard_key].writer.writerow(row)

if shard_key not in shard_files:

outfile = open(out_path, 'w', newline='')

writer = csv.writer(outfile)

writer.writerow(header)

shard_files[shard_key] = (outfile, writer)

logging.info(f"Created shard file: {out_path}")

shard_files[shard_key][1].writerow(row)

statvar_imports/storm_noaa/download_script.py

statvar_imports/storm_noaa/manifest.json

statvar_imports/storm_noaa/shard_noaa_data.py

statvar_imports/storm_noaa/download_script.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

statvar_imports/storm_noaa/download_script.py

statvar_imports/storm_noaa/add_geoids_to_storm_data.py

statvar_imports/storm_noaa/README.md

Autorefresh for storm Noaa

1802979

Megha18jain requested a review from krishnaswamypradeep October 30, 2025 09:31

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

krishnaswamypradeep reviewed Oct 30, 2025

View reviewed changes

Megha18jain added 5 commits October 31, 2025 06:19

Added test data

d57429d

Merge branch 'master' into stormnoaa

7f4bb88

Changed file names

8e1789e

Adding Readme

4d5cb23

Merge branch 'master' into stormnoaa

8b78fde

krishnaswamypradeep approved these changes Oct 31, 2025

View reviewed changes

Megha18jain and others added 8 commits November 3, 2025 12:34

Merge branch 'master' into stormnoaa

331a855

Apply suggestion from @gemini-code-assist[bot]

b729667

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Updating files for Storm Noaa

8b5969c

Merge branch 'master' into stormnoaa

89056a3

Updating files for Storm Noaa

7e1847f

Merge remote-tracking branch 'origin/stormnoaa'

7b25643

Merge branch 'master' into stormnoaa

f080d5e

Merge branch 'master' into stormnoaa

1b4890d

krishnaswamypradeep reviewed Dec 30, 2025

View reviewed changes

statvar_imports/storm_noaa/download_script.py Show resolved Hide resolved

statvar_imports/storm_noaa/add_geoids_to_storm_data.py Show resolved Hide resolved

statvar_imports/storm_noaa/README.md Show resolved Hide resolved

Comments Addressed

d54355a

krishnaswamypradeep approved these changes Dec 30, 2025

View reviewed changes

	if local_csv_path in existing_files:
	if filename[:-3] in existing_files:

		indirectInjuries C:storm_noaa->indirectInjuries
		directDeaths C:storm_noaa->directDeaths

	crop_damage = cost_to_int(row.get("DAMAGE_CROPS", "0")),
	crop_damage = cost_to_int(row.get("DAMAGE_CROPS", "0"))

	unit = "MilesPerHour" if row.get("MAGNitude_TYPE") else "Inch"
	unit = "MilesPerHour" if row.get("MAGNITUDE_TYPE") else "Inch"

Autorefresh for storm Noaa #1682

Are you sure you want to change the base?

Autorefresh for storm Noaa #1682

Uh oh!

Conversation

Megha18jain commented Oct 30, 2025

Uh oh!

gemini-code-assist bot commented Oct 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants