Positive pixel count and parallel processing with Dask¶
Overview¶
Positive pixel count is a routine that classifies pixels by their position in HSI Color Space and computes statistics based on this classification.
HistomicsTK has two main functions for positive pixel counting, count_slide
and count_image
, both of which live in the histomicstk.segmentation.positive_pixel_count
module (imported in this notebook as simply ppc
). count_image
operates on an in-memory image in a format compatible with NumPy’s ndarray
type, and returns both the classification statistics and a label image that may be useful for visualization or further analysis.
count_slide
accepts a path to an image instead, and while it can also carry out exactly the same computation as count_image
, its advantage lies in its ability to distribute its computation using Dask and to operate on images too large to fit in memory. This ability comes at a cost – to enable it, the generation of the label image must be disabled. (For the curious, the necessary underlying support for writing large images a tile at a time
is lacking.) The HistomicsTK CLI PositivePixelCount
is a wrapper around this function.
The rest of this example is subdivided, in order, into count_image
and count_slide
examples, followed by a CLI example.
[1]:
# Configuration and imports of other libraries
from __future__ import print_function
import large_image
%matplotlib inline
import matplotlib.pyplot as plt
import skimage.io
#Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 15, 15
plt.rcParams['image.cmap'] = 'gray'
# Import and alias positive_pixel_count
import histomicstk.segmentation.positive_pixel_count as ppc
Analyzing a small image with count_image
¶
We start with count_image
. The image we will look at is small and can be analyzed quickly.
[2]:
image_url = ('https://data.kitware.com/api/v1/file/'
'598b71ee8d777f7d33e9c1d4/download') # DAB.png
im_input = skimage.io.imread(image_url)
print('Input image')
plt.imshow(im_input)
plt.show()
Input image
With a helper function, it becomes reasonable to use count_image
interactively to explore parameter values. A more advanced analysis might also look at the HSI values directly.
Here, we show briefly how one can use the helper function count_and_label
defined below.
[3]:
def count_and_label(params):
"Compute the label image with count_image, and then display it"
label_image = ppc.count_image(im_input, params)[1]
plt.imshow(label_image)
plt.show()
To start out, we’ll pick a set of parameters. The HSI color space used by the routines is defined to use values in the range [0,1].
[4]:
import ipywidgets as widgets
from ipywidgets import interactive
params= {
'hue_value': widgets.FloatSlider(min=0.0,max=1.0,step=0.004,style = {'description_width': 'initial'}, value=0.05),
'hue_width':widgets.FloatSlider(min=0.0,max=1.0,step=0.05,style = {'description_width': 'initial'}, value=0.15),
'saturation_minimum': widgets.FloatSlider(min=0.0,max=1.0,step=0.05,style = {'description_width': 'initial'}, value=0.05),
'intensity_upper_limit': widgets.FloatSlider(min=0.0,max=1.0,step=0.05,style = {'description_width': 'initial'}, value=0.95),
'intensity_weak_threshold': widgets.FloatSlider(min=0.0,max=1.0,step=0.05,style = {'description_width': 'initial'}, value=0.65),
'intensity_strong_threshold':widgets.FloatSlider(min=0.0,max=1.0,step=0.05,style = {'description_width': 'initial'}, value=0.35),
'intensity_lower_limit':widgets.FloatSlider(min=0.0,max=1.0,step=0.05,style = {'description_width': 'initial'}, value=0.05),
}
def g(**params):
count_and_label(ppc.Parameters(**params))
w = interactive(g,**params)
display(w)
template_params = ppc.Parameters(**w.kwargs)
Internally, the label values are conveniently increasing, regularly spaced integers. Black is therefore background / negative, dark gray is weak positive, light gray is positive, and white is strong positive. To use the label values programmatically, use of the static attributes of ppc.Labels
– .NEGATIVE
, .WEAK
, .PLAIN
, and .STRONG
– is recommended.
This does a reasonably good job aleady, but let’s see what happens if we shift the hue center a bit. Since Parameters
is a collections.namedtuple
, we can use its ._replace
method to substitute a value.
[5]:
count_and_label(template_params._replace(hue_value=0.0))
Here, we see that a number of pixels previously considered positive are now considered negative, creating holes in visible nuclei. This indicates that we’ve moved the hue range too far.
In any case, we can also view the statistics. Here we use the original parameter values.
[6]:
stats, label_image = ppc.count_image(im_input, template_params)
def pp_namedtuple(t):
"Pretty-print a namedtuple by printing each field on its own line and left-aligning all values"
print(type(t).__name__)
maxlen = max(map(len, t._fields))
for f in t._fields:
print(f, getattr(t, f), sep=':' + ' ' * (maxlen - len(f)) + '\t')
pp_namedtuple(stats)
Output
NumberWeakPositive: 62163
NumberPositive: 24459
NumberStrongPositive: 4948
NumberTotalPixels: 1440000
IntensitySumWeakPositive: 49554.65359477124
IntensitySumPositive: 12907.145098039215
IntensitySumStrongPositive: 1423.4183006535948
IntensityAverage: 0.6976653597626302
RatioStrongToTotal: 0.054035164355138145
IntensityAverageWeakAndPositive: 0.7210846977997558
RatioStrongToPixels: 0.003436111111111111
RatioWeakToPixels: 0.04316875
RatioTotalToPixels: 0.06359027777777777
Analyzing a large image with count_slide
¶
count_slide
has several more parameters than count_image
. Besides make_label_image
, which controls the creation of the label image as mentioned, there is also region
, which instructs count_slide
to operate on only the specified region of the image.
We will visualize its operation on a small region, and then run it on a larger region.
Additional setup¶
First, however, we’ll need to download an image and set up Dask. (The image is 1.630 GiB, so the download takes a while.)
[7]:
# Comment this out (or just don't run it) once you have the file
!curl -OJ 'https://data.kitware.com/api/v1/file/598b5ee88d777f7d33e9c1d1/download'
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1668M 100 1668M 0 0 1955k 0 0:14:33 0:14:33 --:--:-- 1665k0 0:14:48 0:00:20 0:14:28 1975k 0 0:14:34 0:03:30 0:11:04 2128k 0 0:14:04 0:06:59 0:07:05 2154k18 0:00:14 1926k
curl: Saved to filename 'TCGA-DX-A6BG-01Z-00-DX2.34763958-0613-4069-9ACC-13D6633FE415.svs'
[8]:
# Set up a basic configuration. Change as needed.
import dask.distributed
dask.distributed.Client()
[8]:
Client
Client-80fcf7a4-e10a-11ec-ac62-3529764f914a
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: http://127.0.0.1:8787/status |
Cluster Info
LocalCluster
dac1cd10
Dashboard: http://127.0.0.1:8787/status | Workers: 4 |
Total threads: 16 | Total memory: 62.54 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-a8376504-33ed-495c-ab64-c0f97dceadb6
Comm: tcp://127.0.0.1:39261 | Workers: 4 |
Dashboard: http://127.0.0.1:8787/status | Total threads: 16 |
Started: Just now | Total memory: 62.54 GiB |
Workers
Worker: 0
Comm: tcp://127.0.0.1:39463 | Total threads: 4 |
Dashboard: http://127.0.0.1:37567/status | Memory: 15.63 GiB |
Nanny: tcp://127.0.0.1:34409 | |
Local directory: /home/local/KHQ/mary.salvi/projects/HistomicsTK/docs/examples/dask-worker-space/worker-gx6r831t |
Worker: 1
Comm: tcp://127.0.0.1:45111 | Total threads: 4 |
Dashboard: http://127.0.0.1:33357/status | Memory: 15.63 GiB |
Nanny: tcp://127.0.0.1:39303 | |
Local directory: /home/local/KHQ/mary.salvi/projects/HistomicsTK/docs/examples/dask-worker-space/worker-rz2sm3be |
Worker: 2
Comm: tcp://127.0.0.1:46301 | Total threads: 4 |
Dashboard: http://127.0.0.1:36999/status | Memory: 15.63 GiB |
Nanny: tcp://127.0.0.1:44531 | |
Local directory: /home/local/KHQ/mary.salvi/projects/HistomicsTK/docs/examples/dask-worker-space/worker-g_4eo0ci |
Worker: 3
Comm: tcp://127.0.0.1:42875 | Total threads: 4 |
Dashboard: http://127.0.0.1:38343/status | Memory: 15.63 GiB |
Nanny: tcp://127.0.0.1:44007 | |
Local directory: /home/local/KHQ/mary.salvi/projects/HistomicsTK/docs/examples/dask-worker-space/worker-sr7795gj |
count_slide
on a small region¶
Now we can inspect the image. We’ll first run the calculation to produce a label image, and then run it again without, allowing parallelization.
[9]:
slide_path = 'TCGA-DX-A6BG-01Z-00-DX2.34763958-0613-4069-9ACC-13D6633FE415.svs'
region = dict(
left=50000, top=35000,
width=1600, height=900,
)
ts = large_image.getTileSource(slide_path)
im_region = ts.getRegion(region=region, format=large_image.tilesource.TILE_FORMAT_NUMPY)[0]
print('The region')
plt.imshow(im_region)
plt.show()
The region
We’ll reuse the parameters from before.
[10]:
stats, label_image = ppc.count_slide(slide_path, template_params, region=region, make_label_image=True)
pp_namedtuple(stats)
plt.imshow(label_image)
plt.show()
Output
NumberWeakPositive: 31591
NumberPositive: 26850
NumberStrongPositive: 5582
NumberTotalPixels: 1440000
IntensitySumWeakPositive: 24633.738562091505
IntensitySumPositive: 13946.745098039215
IntensitySumStrongPositive: 1549.1424836601307
IntensityAverage: 0.626800152192038
RatioStrongToTotal: 0.08718741702200772
IntensityAverageWeakAndPositive: 0.6601612508364114
RatioStrongToPixels: 0.003876388888888889
RatioWeakToPixels: 0.021938194444444444
RatioTotalToPixels: 0.04446041666666667
The output is about the same quality as before. As the small region used in the previous section is in fact extracted from another part of this slide, this is not too surprising.
We’ll now rerun for only the stats, which will make use of Dask. make_label_image
defaults to False
, so we simply omit it here. For the purposes of illustration, we force tile_grouping
to 1 to process each tile in its own task. A larger value, 256, is used as the default value to reduce the overhead associated with Dask tasks.
[11]:
# Note that we still return a tuple, though it now has length 1.
stats_dask, = ppc.count_slide(slide_path, template_params, region=region, tile_grouping=1)
pp_namedtuple(stats_dask)
Output
NumberWeakPositive: 31591
NumberPositive: 26850
NumberStrongPositive: 5582
NumberTotalPixels: 1440000
IntensitySumWeakPositive: 24633.73856209151
IntensitySumPositive: 13946.745098039215
IntensitySumStrongPositive: 1549.142483660131
IntensityAverage: 0.6268001521920381
RatioStrongToTotal: 0.08718741702200772
IntensityAverageWeakAndPositive: 0.6601612508364115
RatioStrongToPixels: 0.003876388888888889
RatioWeakToPixels: 0.021938194444444444
RatioTotalToPixels: 0.04446041666666667
The results are identical up to a tiny amount of floating point error, as can be seen below:
[12]:
print('stats_dask - stats:')
pp_namedtuple(ppc.Output(**{f: getattr(stats_dask, f) - getattr(stats, f) for f in ppc.Output._fields}))
stats_dask - stats:
Output
NumberWeakPositive: 0
NumberPositive: 0
NumberStrongPositive: 0
NumberTotalPixels: 0
IntensitySumWeakPositive: 3.637978807091713e-12
IntensitySumPositive: 0.0
IntensitySumStrongPositive: 2.2737367544323206e-13
IntensityAverage: 1.1102230246251565e-16
RatioStrongToTotal: 0.0
IntensityAverageWeakAndPositive: 1.1102230246251565e-16
RatioStrongToPixels: 0.0
RatioWeakToPixels: 0.0
RatioTotalToPixels: 0.0
count_slide
on a large region¶
As a better exhibition of count_slide
’s use of parallelism in computing statistics, we run it here on a much larger region, 30Kx30K pixels.
(By leaving out the region
parameter, count_slide
will look at the entire image. At 146Kx79K pixels, this image will cause count_slide
to use a lot of memory, potentially enough to require swap space, without additional configuration of large_image
’s caching behavior.)
[13]:
large_region = dict(
left=60e3, top=30e3,
width=30e3, height=30e3,
)
stats, = %time ppc.count_slide(slide_path, template_params, large_region)
pp_namedtuple(stats)
CPU times: user 992 ms, sys: 87 ms, total: 1.08 s
Wall time: 16.1 s
Output
NumberWeakPositive: 8017246
NumberPositive: 4348294
NumberStrongPositive: 599706
NumberTotalPixels: 900000000
IntensitySumWeakPositive: 6295544.718954247
IntensitySumPositive: 2316964.7856209157
IntensitySumStrongPositive: 174806.2339869282
IntensityAverage: 0.6777592757254349
RatioStrongToTotal: 0.04625488787486177
IntensityAverageWeakAndPositive: 0.6964927940530833
RatioStrongToPixels: 0.00066634
RatioWeakToPixels: 0.008908051111111112
RatioTotalToPixels: 0.014405828888888889
[14]:
# pp_namedtuple(template_params)
# print('\nRegion:', region)
[15]:
# %%script sh
# # Stats and label image output must be specified via file arguments to
# # --returnparameterfile and --outputLabelImage, respectively.
# python ../../server/PositivePixelCount/PositivePixelCount.py \
# 'TCGA-DX-A6BG-01Z-00-DX2.34763958-0613-4069-9ACC-13D6633FE415.svs' \
# 0.05 0.15 0.05 0.95 0.65 0.35 0.05 --region 50000,35000,1600,900 \
# --returnparameterfile stats.txt --outputLabelImage labelImage.png 2>/dev/null
[16]:
# print("stats.txt:")
# for l in open('stats.txt'):
# print(l.rstrip())
# print("\nlabelImage.png:")
# plt.imshow(skimage.io.imread('labelImage.png'))
# plt.show()