Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs import adds duplicate entries in taxa tree master #52

Open
roger-mahler opened this issue Feb 16, 2023 · 9 comments
Open

Bugs import adds duplicate entries in taxa tree master #52

roger-mahler opened this issue Feb 16, 2023 · 9 comments
Assignees
Labels
bug Something isn't working data-error Incorrect data in SEAD quality-control Quality control related task

Comments

@roger-mahler
Copy link
Collaborator

roger-mahler commented Feb 16, 2023

Duplicates can be identified by this query:

with duplicates as (
	select distinct author_name, genus_name, species
	from tbl_taxa_tree_master
	join tbl_taxa_tree_genera using (genus_id)
	join tbl_taxa_tree_authors using (author_id)
	group by author_name, genus_name, species
	having count(*) > 1
), taxa as (
	select taxon_id, genus_id, author_id, author_name, genus_name, species
	from tbl_taxa_tree_master
	join tbl_taxa_tree_genera using (genus_id)
	join tbl_taxa_tree_authors using (author_id)
)
	select taxon_id, author_name, genus_name, species, bt.*
	from duplicates
	join taxa using ( author_name, genus_name, species)
	left join bugs_import.bugs_trace as bt
	  on bt.sead_table = 'tbl_taxa_tree_master'
	 and bt.sead_reference_id = taxon_id
	 -- and bt.bugs_table = 'INDEX'
	order by taxon_id
Bugs import Records Comment
Clean (no bugs import) 0 duplicates
sead_production_201912 78 duplicates After import of bugsdata_20190303
sead_production_201912 1636 duplicates After incremental import
@roger-mahler roger-mahler self-assigned this Feb 16, 2023
@roger-mahler roger-mahler added bug Something isn't working data-error Incorrect data in SEAD quality-control Quality control related task labels Feb 16, 2023
@roger-mahler
Copy link
Collaborator Author

roger-mahler commented Feb 16, 2023

Sample duplicate data from sead_production_201912:
author_name = '(L.)', genus_name = 'Elaphrus', species = 'riparius'

select taxon_id, t.date_updated, author_id, genus_id, author_name, genus_name, species, bt.bugs_trace_id
from tbl_taxa_tree_master t
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
left join bugs_import.bugs_trace as bt
  on bt.sead_table = 'tbl_taxa_tree_master'
 and bt.sead_reference_id = taxon_id
where author_name = '(L.)'
  and genus_name = 'Elaphrus'
  and species = 'riparius'
taxon_id date_updated author_id genus_id bugs_trace_id comment
28965 2014-04-17 06:02:04 3103 12656 existing record
40204 2019-12-20 23:26:18 3103 12656 150283 added by import

Imported data seems to be assigned to existing taxon:

select *
from tbl_abundances
join tbl_analysis_entities using (analysis_entity_id)
join tbl_datasets using (dataset_id)
where taxon_id in (28965, 40204)

No item assigned to 40204.

@roger-mahler
Copy link
Collaborator Author

roger-mahler commented Feb 17, 2023

Same data from BugsCEP:

select *
from "INDEX"
where "AUTHORITY" = '(L.)'
  and "GENUS" = 'Elaphrus'
  and "SPECIES" = 'riparius'
CODE FAMILY GENUS SPECIES AUTHORITY
1.0120030000000000 CARABIDAE Elaphrus riparius (L.)

This specie has a synonym:

select *
from "TSynonym"
where "SynAuthority" = '(L.)'
  and "SynGenus" = 'Elaphrus'
  and "SynSpecies" = 'riparius'
CODE SynGenus SynSpecies SynAuthority Ref Notes SynonymCODE
1.0120045000000000 Elaphrus riparius (L.) Lindroth 1985 regards this as a synonym. 31
select *
from "INDEX"
where "CODE" in ('1.0120030000000000', '1.0120045000000000')
CODE FAMILY GENUS SPECIES AUTHORITY
1.0120030000000000 CARABIDAE Elaphrus riparius (L.)
1.0120045000000000 CARABIDAE Elaphrus tuberculatus Mäklin

Motsvarande poster i sead_production:

select taxon_id, family_name, genus_name, species, author_name, taxonomic_code 
from tbl_taxa_tree_master
join tbl_taxonomic_order using (taxon_id)
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
join tbl_taxa_tree_families using (family_id)
where taxonomic_code in (1.0120030000, 1.0120045000)
taxon_id family_name genus_name species author_name taxonomic_code
28965 CARABIDAE Elaphrus riparius (L.) 1.0120030000
28966 CARABIDAE Elaphrus tuberculatus Mäklin 1.0120045000
select taxon_id, family_name, genus_name, species, author_name, taxonomic_code 
from tbl_taxa_tree_master
left join tbl_taxonomic_order using (taxon_id)
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
join tbl_taxa_tree_families using (family_id)
where family_name = 'CARABIDAE'
  and genus_name = 'Elaphrus'
  and author_name in ('(L.)', 'Mäklin')
taxon_id family_name genus_name species author_name taxonomic_code
28965 CARABIDAE Elaphrus riparius (L.) 1.0120030000
28966 CARABIDAE Elaphrus tuberculatus Mäklin 1.0120045000
40204 CARABIDAE Elaphrus riparius (L.)
select bugs_table, sead_table, sead_reference_id, bugs_data, manipulation_type
from bugs_import.bugs_trace
where translated_compressed_data like '%Elaphrus%'
  and bugs_table = 'TSynonym' 
bugs_table sead_table sead_reference_id bugs_data manipulation_type
TSynonym tbl_taxa_tree_master 40204 {1.0120045,Elaphrus,riparius,(L.),Lindroth 1985,regards this as a synonym.} INSERT
TSynonym tbl_species_associations 2650 {1.0120045,Elaphrus,riparius,(L.),Lindroth 1985,regards this as a synonym.} INSERT

@roger-mahler
Copy link
Collaborator Author

roger-mahler commented Feb 19, 2023

The bug can be reproduced with the following small BugsCEP database. The data is created using Jailer for sample SAMP000546, with a few data points from TFossil. A record with CODE 1.0120045 has been added to INDEX and TSynonym (these two records might be sufficient to reproduce the error).

insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('1.0120045000000000', 'CARABIDAE', 'Elaphrus', 'tuberculatus', 'Mäklin');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('1.0120030000000000', 'CARABIDAE', 'Elaphrus', 'riparius', '(L.)');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('23.028001499999998', 'STAPHYLINIDAE', 'Eucnecosum', 'brachypterum (grp)', '(Grav.)');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('93.015056999999999', 'CURCULIONIDAE', 'Otiorhynchus', 'nodosus', '(Möll.)');
insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values ('40.002102000000001', 'SCIRTIDAE', 'Ptilodactyla', 'exotica', 'Chapin');

insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Bell & Walker (2005)', 'Bell, M. & Walker M.J.C. (1992)', 'Late Quaternary Environmental Change - Physical and Human Perspectives (Second Edition). Longman, Essex.', NULL);
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Bohme 2005', 'Böhme, J. (2005)', 'Die Köfer Mitteleuropas. K. Katalog (Faunistiche Übersicht) (2nd ed.). Spektrum Academic, Munich.', '(revised version of Lucht 1987)');
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Morris 2006', 'Morris, M. (2006)', 'Checklist of beetles of the British Isles, Curculionidae. <www.coleopterist.org.uk/curculionidae-list.htm>.', NULL);
insert into "TBiblio" ("REFERENCE", "AUTHOR", "TITLE", "Notes") values ('Strand 1946', 'Strand, A. (1946)', 'Nord Norges Coleoptera. Tromsö Museums Arshefter, Naturhistorisk Avd. Nr. 34, 67(1). (629pp.)', NULL);

insert into "TSite" ("SiteCODE", "SiteName", "Region", "Country", "NGR", "LatDD", "LongDD", "Alt", "IDBy", "Interp", "Specimens") values ('SITE000253', 'Håkulls Mosse, Kullaberg', 'Skåne', 'Sweden', NULL, 56.2999992, 12.5333338, 125, 'Lemdahl', 'Kullen Peninsula, see also Björkeröds mosse.', NULL);

insert into "TCountsheet" ("CountsheetCODE", "CountsheetName", "SiteCODE", "SheetContext", "SheetType") values ('COUN000144', 'Hakullsmosse_bugsdata.XLS', 'SITE000253', 'Stratigraphic sequence', 'Abundances');

insert into "TSample" ("SampleCODE", "SiteCODE", "X", "Y", "ZorDepthTop", "ZorDepthBot", "RefNrContext", "CountsheetCODE") values ('SAMP000546', 'SITE000253', NULL, NULL, NULL, NULL, 'B8:6/6', 'COUN000144');

insert into "TDatesMethods" ("Abbrev", "Method", "Type", "SortOrder") values ('GeolPer', 'Geological period', 'Period', 2);

insert into "TDatesPeriod" ("PeriodDateCODE", "SampleCODE", "Uncertainty", "PeriodCODE", "DatingMethod", "Notes") values ('PERI005175', 'SAMP000546', NULL, 'LG', 'GeolPer', NULL);

insert into "TFossil" ("FossilBugsCODE", "CODE", "SampleCODE", "Abundance") values ('FOSS014299', '93.015056999999999', 'SAMP000546', 3);
insert into "TFossil" ("FossilBugsCODE", "CODE", "SampleCODE", "Abundance") values ('FOSS144182', '23.028001499999998', 'SAMP000546', 30);

insert into "TLookupCountsheetContext" ("SheetContext", "SortOrder") values ('Stratigraphic sequence', 3);

insert into "TLookupCountsheetTypes" ("CountsheetType", "SortOrder") values ('Abundances', 1);

insert into "TPeriods" ("PeriodCODE", "PeriodName", "PeriodType", "PeriodDesc", "PeriodRef", "PeriodGeog", "Begin", "BeginBCAD", "End", "EndBCAD", "YearsType") values ('LG', 'Lateglacial', 'Geological', 'Cold period after the last Glaciation. Pollen Zones I-III', 'Bell & Walker (2005)', 'Europe', 13500, 'BP', 10000, 'BP', 'C14');

insert into "TSiteOtherProxies" ("OtherProxyID", "SiteCODE", "HasPollen", "HasPlantMacro", "HasDiatoms", "HasChironomids", "HasSoilChemistry", "HasIsotopes", "HasAnimalBones", "HasArchaeology", "HasMolluscs") values (1, 'SITE000253', 1, 0, 0, 0, 0, 0, 0, 0, 0);

insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('93.015056999999999', 'Otiorhynchus', 'dubius', '(Ström.)', 'Bohme 2005', NULL);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('93.015056999999999', 'Otiorhynchus', 'maurus', '(Gyllenhal) non (Marsham)', 'Morris 2006', NULL);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('23.028001499999998', 'Arpedium', NULL, NULL, 'Strand 1946', NULL);
insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values ('1.0120045000000000', 'Elaphrus', 'riparius', '(L.)', 'Lindroth 1985', 'regards this as a synonym.');

insert into "TTaxoNotes" ("CODE", "Ref", "Data") values ('40.002102000000001', 'Bohme 2005', 'Genus not listed.');

@roger-mahler
Copy link
Collaborator Author

roger-mahler commented Feb 20, 2023

This is a minimal BugsCEP database that reproduces the error:

insert into "INDEX" ("CODE", "FAMILY", "GENUS", "SPECIES", "AUTHORITY") values
    ('1.0120045000000000', 'CARABIDAE', 'Elaphrus', 'tuberculatus', 'Mäklin');
    ('1.0120030000000000', 'CARABIDAE', 'Elaphrus', 'riparius', '(L.)');

insert into "TSynonym" ("CODE", "SynGenus", "SynSpecies", "SynAuthority", "Ref", "Notes") values
    ('1.0120045000000000', 'Elaphrus', 'riparius', '(L.)', 'Lindroth 1985', 'regards this as a synonym.');

@roger-mahler
Copy link
Collaborator Author

roger-mahler commented Feb 21, 2023

Import INDEX items

The mapping between items in BugsCEP INDEX and SEAD taxon_id is kept in table tbl_taxonomic_order i.e. the first step is to find taxon_id where taxonomic_code equals INDEX.CODE.

The items already exists in the SEAD database, so no new items are inserted.

select *
from tbl_taxonomic_order
where taxonomic_code in (1.0120030000, 1.0120045000)
taxonomic_order_id date_updated taxon_id taxonomic_code taxonomic_order_system_id
20245 2014-04-17 06:02:04 28965 1.0120030000 1
20246 2014-04-17 06:02:04 28966 1.0120045000 1

@roger-mahler
Copy link
Collaborator Author

roger-mahler commented Feb 21, 2023

Import Synonym items

Each BugsCEP synonym item should have a corresponding record in SEAD table tbl_species_associations. The minimal test database has the following synonym item:

CODE SynGenus SynSpecies SynAuthority Ref Notes
1.01200450 Elaphrus riparius (L.) Lindroth 1985 regards this as a synonym.

If a synonym is previously imported (exists in the import trace) then nothing needs to be done. Otherwise a new species association is created comprising of:

  • a new or existing source species (taxon_id)
  • a new or existing target species (associated_taxon_id)
  • a reference (biblio_id)
  • an association type

The target species (tbl_taxa_tree_master ) is found by searching for synonym's CODE in tbl_taxa_tree_order. The source species (tbl_taxa_tree_master ) is found by searching for synonym's species, genus and author in species repository (tbl_taxa_tree_master ).

@roger-mahler
Copy link
Collaborator Author

roger-mahler commented Feb 21, 2023

Find or create association's target species

Fetch target species (tbl_taxa_tree_master ) using taxonomic_code in tbl_taxa_tree_order. This item should always exists - if not there is an inconsistency in the BugsCEP database.

select taxon_id
from tbl_taxonomic_order
where taxonomic_code in (1.0120045000)
taxonomic_order_id date_updated taxon_id taxonomic_code taxonomic_order_system_id
20246 2014-04-17 06:02:04 28966 1.0120045000 1

private void setTargetSpecies(){
TaxaSpecies targetSpecies = taxonomicOrderRepository.findBugsSpeciesByCode(BigDecimalDefinition.convertToSeadCode(synonym.getCode()));
if(targetSpecies == null){
seadData.addError("No species found for code");
return;
}
seadData.setTargetSpecies(targetSpecies);
}

select taxon_id, author_name, genus_name, species
from tbl_taxa_tree_master
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
where taxon_id = 28966    
taxon_id author_name genus_name species
28966 Mäklin Elaphrus tuberculatus

@roger-mahler
Copy link
Collaborator Author

roger-mahler commented Feb 21, 2023

Find or create association's source species

Finding the source species is basically a simple search in SEAD of the given species name (riparius), genus name (Elaphrus) and authority name (Mäklin).

select taxon_id, author_name, genus_name, species
from tbl_taxa_tree_master
join tbl_taxa_tree_genera using (genus_id)
join tbl_taxa_tree_authors using (author_id)
where species = 'riparius'
  and genus_name = 'Elaphrus'
  and author_name = '(L.)'

Note that this species exists in the SEAD database. The bugs import, however, uses the authority name given by the target species ((L.)) instead of the authority given by the synonym item (Mäklin). This can be seen on line 149 in the following code snippet.

private TaxaSpecies getOrCreate(TaxaGenus synonymGenus, Synonym synonym, TaxaSpecies targetSpecies){
String name = synonym.getSynSpecies() != null ? synonym.getSynSpecies() : targetSpecies.getSpeciesName();
String authorName = targetSpecies.getAuthor() != null ? targetSpecies.getAuthor().getAuthorName() : null;
TaxaSpecies found = speciesRepository.findBySpeciesNameAndGenusGenusNameAndAuthorAuthorName(name, synonymGenus.getGenusName(), authorName);
if(found == null || !authorityManager.authorityMatches(found, targetSpecies)){
return createBaseSynonymSpecies(synonymGenus, synonym, targetSpecies);
} else {
return found;
}
}

Since the combination (riparius, Elaphrus, (L.)) is missing in the database the system adds a new species.This species is however assigned the authority of the synonym (Mäklin), and not the authority used when searching for an existing species ((L.)).

This results in a duplicate species record (riparius, Elaphrus, Mäklin) in the database.

@roger-mahler
Copy link
Collaborator Author

roger-mahler commented Feb 21, 2023

We have two options to correct the issue:

  1. When searching for existing source species the system should use the same autority as specified in the synonym (SynAuthor).
  2. When creating a new source species then it should be assigned the same authority as the target species.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data-error Incorrect data in SEAD quality-control Quality control related task
Projects
None yet
Development

No branches or pull requests

2 participants