Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate difference - label shuffling error #8

Open
dBenedek opened this issue May 4, 2021 · 1 comment
Open

Calculate difference - label shuffling error #8

dBenedek opened this issue May 4, 2021 · 1 comment
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@dBenedek
Copy link

dBenedek commented May 4, 2021

I encountered the following error when running calculate_difference() with label shuffling:
Error in ecdf(shuffled[i, ]) : 'x' must have 1 or more non-missing values

It turned out that some rows (genes) with only 0 values caused the issue.
Would be nice to find a solution/recommendation on how to handle this problem (e.g. by simply dropping genes with only 0s before difference calculation or something else).

Another issue related to this problem: after calculating diversity, there are some rows with a few really low values (values < .Machine$double.eps), that are handled as 0s so these rows also cause errors:

Error in if (ecdf(shuffled[i, ])(log2_fc[i]) >= 0.5) { : 
  missing value where TRUE/FALSE needed
Calls: calculate_difference -> label_shuffling
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I avoided this issue with some pre-filtering:

# Convert really small values to 0s:
diversity_data[.Machine$double.eps > diversity_data] <- 0

# Filter out samples with only zeros:
diversity_data_filtered <- diversity_data %>% 
  mutate(rowsum=rowSums(select(., starts_with("dataset")))) %>% 
  filter(rowsum != 0) %>% 
  dplyr::select(-rowsum)
@esebesty esebesty self-assigned this May 4, 2021
@esebesty esebesty added bug Something isn't working documentation Improvements or additions to documentation labels May 4, 2021
@dBenedek
Copy link
Author

dBenedek commented May 9, 2021

I also get the error message for a row with 2 unique values and some additional 0s and NAs:

> diversity_data_filtered[404,]

                    ID dataset3_SRX4143134 dataset3_SRX4143118 dataset3_SRX4143107 dataset3_SRX4143157
404 ENSG00000030304.14                  NA          0.03355015                  NA                  NA
    dataset3_SRX4143185 dataset3_SRX4143142 dataset3_SRX4143119 dataset3_SRX4143193 dataset3_SRX4143122
404                   0                   0                   0                  NA            0.999901
    dataset3_SRX4143121
404                   0
difference <- calculate_difference(diversity_data_filtered[404:405,], # 404 - problematic row
                                   samples_data,
                                   control = "control",
                                   method = "mean",
                                   test = "shuffle")

# --> Error in if (ecdf(shuffled[i, ])(log2_fc[i]) >= 0.5) { : 
#  missing value where TRUE/FALSE needed

I guess the reason for this is that the sample number (columns) is low (10), and there are only 2 non-zero/non-NA values, so there will be some problematic groups during the shuffling procedure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants