Fix XMLSyntaxError "Namespace prefix mailto for dade on a is not defined" in convert_html_to_xml step 30#160
Draft
Fix XMLSyntaxError "Namespace prefix mailto for dade on a is not defined" in convert_html_to_xml step 30#160
Conversation
…rees Agent-Logs-Url: https://github.com/scieloorg/scielo_migration/sessions/807d21a0-f07d-4b7f-9102-fcce21174c55 Co-authored-by: robertatakenaka <505143+robertatakenaka@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix XMLSyntaxError during HTML to XML conversion
Fix XMLSyntaxError "Namespace prefix mailto for dade on a is not defined" in convert_html_to_xml step 30
Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
O que esse PR faz?
Corrige o
XMLSyntaxError: Namespace prefix mailto for dade on a is not definedque abortaconvert_html_to_xml_step_30_embed_html(e cascateia paraXML Status BLOCKEDno Article Proc) quando o HTML de origem traz atributos malformados com prefixo de namespace não declarado, p.ex.<a mailto:dade="...">.remove_invalid_namespace_attributes(tree)emscielo_classic_website/htmlbody/html_fixer.py: percorre a árvore lxml e remove atributos cujo nome contém:mas o prefixo não está declarado. Preservaxml/xlinke atributos já mapeados em notação Clark ({uri}localname).load_htmleget_fixed_html, cobrindo ambos os caminhos usados porHTMLContent→html_to_node→MainHTMLPipe.TestRemoveInvalidNamespaceAttributesemtests/test_html_fixer.py(7 casos, incluindo o cenário em que o valor do atributo contém>— caso em que a limpeza textualremove_namespaces_from_contentfalha).Onde a revisão poderia começar?
scielo_classic_website/htmlbody/html_fixer.py— funçãoremove_invalid_namespace_attributese suas chamadas emload_html/get_fixed_html.Como este poderia ser testado manualmente?
Ou rodar
pytest tests/test_html_fixer.py.Algum cenário de contexto que queira dar?
lxml.html.fromstringaceitamailto:dadecomo nome literal de atributo. Quando o<body>parseado é inserido no XML do artigo e re-serializado porEndPipe, oStartPipedo step 30 chamaET.fromstring(...)(parser XML estrito), que interpretamailto:como prefixo de namespace inexistente e levantaXMLSyntaxError. A limpeza textual existente (remove_namespaces_from_content) é frágil — falha quando o valor do atributo contém>ou aspas. Por isso o saneamento é feito no nível da árvore após o parsing.Screenshots
N/A.
Quais são tickets relevantes?
Issue reportada em
scieloorg/scielo_migrationreferente à tarefamigrate_and_publish_articlescomXML Status BLOCKEDe tracebacklxml.etree.XMLSyntaxError: Namespace prefix mailto for dade on a is not defined.Referências
remove_ms_office_conditionals) e PR Fix XMLSyntaxError caused by invalid HTML comments (clipboard artifacts) in HTML→XML conversion #153 (remove_invalid_xml_comments).