Abstract of PhD Thesis Intelligent Data Processing and Its Applications
Aniko Szilvia Vanger
1 Introduction
Nowadays the rapidly increasing performance
of hardware and the efficient
intelligent scientific algorithms enable us
to store and process big data. This
tendency will cover more opportunities to
get more and more information from
the large amount of data. My thesis is only
a precursor of this topic, because
I did not have sufficient hardware and I
had only a little data to be processed.
However, all the topics of my thesis belong
to the intelligent data processing.
In Chapter 2 of my thesis I introduce a new
clustering algorithm named
GridOPTICS, whose goal is to accelerate the
well-known OPTICS density
clustering technique. The density-based clustering
techniques are capable of
recognizing arbitrary-shaped clusters in a
point set. The DBSCAN results in
only one cluster set, but the OPTICS
generates a reachability plot from which
a lot of cluster sets can be read as a
result without having to execute the whole
algorithm again. I experienced that it is
very slow for large data sets, so I wanted
to nd a solution to accelerate it. I wanted
to see that the speed of the GridOptics
is better than OPTICS, so I executed both
the algorithms on several point sets.
In Chapter 3 of my thesis I introduce two
new modules of the Cardiospy system of
Labtech Ltd. On these two projects I worked
together with Istvan Juhasz, Laszlo
Farkas, Peter Toth, and 4 students of the
university, Jozsef Kuk, Adam Balazs,Bela Vamosi, and David Angyal.Bela Kincs,
who was the executive of the Labtech Ltd., wanted the Cardiospy system to be
improved. He and his team surveyed what the demand of the users are in this
area and how their software could be better. The Labtech Ltd. And the University
of Debrecen worked together in two projects. In both cases theLabtech had early
solutions for the algorithms, but they were insufficient and slow, the results
could not be validated, or they gave insufficient results. Moreover,
there were no visualization tools for
either problems. The tasks of the team of the
University of Debrecen were to give a quick
algorithm and to create an interactive
visualization interface for each problem.
The goal of the first module of Cardiospy
is to cluster and visualize the long (up
to 24-hours) recordings of ECG signals,
because the manual evaluation of long
recordings is a lengthy and tedious task.
During this project I recognized that it
is a very interesting topic to find out how
the OPTICS can be accelerated with a
grid clustering method independently,
without any ECG signals.
The goal of the second module of Cardiospy
is to calculate and visualize the
steps of the blood pressure measurement and
the values of blood pressure. The
recordings (which can contain a sequence of
measurements) are collected by a
microcontroller, but this module runs on a
PC. With the help of the application
the physicians can recognize the types of
errors on the measurements and they
can also find the noisy measurements.
In Chapter 4 I introduce how I applied an
active learning method in a subject
whose topic is database programming. I
taught Oracle SQL and PL/SQL in
the Advanced DBMS 1 subject, and I saw that
the students do not practice at
home. The prerequirements of this subject
are the Programming language and
the Database systems courses, so they are
not absolute beginners in the field. I
wanted to force the students to try out the
programming tools independently, but
with the help of the teacher.
To support the active learning method, an
application had to be built. The
application helps the teacher organize and
monitor the tasks and their solutions
of the students. Moreover the application
can verify the syntax of the solutions
before the students upload them. If the
syntax is wrong, the student cannot
upload it. This feature makes the task of
the teacher easier.
To demonstrate whether the active learning
method is good or not, I gathered and
examined the results of the students during
the 3 years when I used this method.
New results
The abstract of the thesis presents new
results grouped into four main statements.
The first statement deals with a clustering
method, the second one demonstrates
an application of this clustering method,
namely clustering of ECG signals, which
can be considered as an application of the
GridOPTICS clustering method. The
third statement introduces the
visualization of the steps of the blood pressure
measurement, whereas the last statement
demonstrates how the solutions of the
students can easily be managed during an
active learning method for database
programming.
2.1 A clustering algorithm
Cluster analysis is an important research field
of data mining, which is applied
on many other disciplines, such as pattern
recognition, image processing, machine
learning, bioinformatics, information
retrieval, artificial intelligence, marketing,
psychology, etc. The density-based
clustering approach is capable of finding
arbitrarily shaped clusters, but they have
a disadvantage, namely it is hard to
choose parameter values in order that the
algorithm gives an appropriate result
(Gan et al., 2007). The OPTICS (Ankerst et
al., 1999) clustering algorithm gives
not only one result but a set of the
results. It builds a reachability plot, namely it
orders the input points, and it assigns a
reachability distance to an input point.
Based on the reachability plot, the
algorithm can produce a lot of clustering
results. Building the reachability plot is
slow, but reading the clusters from the
reachability plot is fast.
The OPTICS has a limitation, namely it has
high complexity, which means that
it is very slow for large datasets. (Yue et
al., 2007) (Schneider and Vlachos, 2013)
Statement A - The GridOPTICS clustering
algorithm: I introduced a
new clustering algorithm named GridOPTICS
which is a combination of a grid
clustering technique and the OPTICS
algorithm. For a large input point sets the
GridOPTICS algorithm works with insignificant
information loss and provides
even one or more order of magnitude faster
than the OPTICS algorithm. (Vagner,
in press)
The main idea of the GridOPTICS algorithm
is to reduce the number of input
points with a grid technique and then to
execute the OPTICS algorithm on the
grid structure. Based on the reachability
plot, the clusters of the grid structure
can be determined. In the end, the input
points can be assigned to the clusters.
The experimental results show that the
execution time can be faster with more
orders of magnitude than OPTICS, which is
very useful for large data sets.
However, they also show that the GridOPTICS
algorithm is less accurate than
OPTICS.
2.2 Cardiology information system for ECG
signals
The big data problem also appears in the
medical area. Without intelligent
information systems, the physicians cannot
eOne of its modules is the ECG clustering module.
Statement B - Clustering and visualization
of ECG signals: We
developed the ECG clustering and
visualization module of Cardiospy software. The
goal of the module is to cluster and
visualize the long (up to 24-hours) recordings
of ECG signals. In this way the
cardiologists can easier find the heart beats which
morphologically differ from the normal
beats. (Vagner et al., 2011 A)
On this project I worked together with
Laszlo Farkas (Labtech Ltd.), Istvan
Juhasz (Faculty of Informatics, University
of Debrecen), and two students from
the Faculty of Informatics, University of
Debrecen, Jozsef Kuk and Adam Balazs.
My contribution to this project was to
implement the clustering algorithm and
make it fast. The clustering algorithm is a
special simpler version of the
GridOPTICS algorithm. I also contributed to
2.3
Cardiology information system for blood pressure measurement
In the public health care it is very common
that a microcontroller calculates the
result of oscillometric blood pressure
measurements. It has only limited resources,
such as memory and processor, moreover it
can give only a little feedback about
the measurement. This means that the result
can be imprecise; it does not inform
the patient and the physician
appropriately. (Sorvoja, 2006)
Cardiospy software has another module, the
blood pressure measurement module.
It receives the recordings collected by the
microcontroller. The recording can
contain only one measurement or sequence of
measurements created during 24
hours. Cardiospy runs on a PC, in this way
the algorithm can use more
resources (memory and processor), which
means that it is faster and more precise.
Additionally, it can visualize the whole
process of the measurement.
Statement C { Visualization of o-line
processing of blood pressure
measurements: We developed the blood
pressure measurement module of
Cardiospy software. The goal of the blood
pressure measurement module is to
calculate and visualize the values of blood
pressure. (Vagner et al., 2014)
The module determines the values of the
blood pressure based on an oscillometric
blood pressure measurement algorithm. The
application visualizes the result of
each step of the algorithm. The algorithm
decides whether the result is acceptable
and authentic based on the characteristic
of the measurement.
The other part of the application helps in
the validation process. It executes
the blood pressure measurement algorithm on
mass of the measurements each of
which has reference blood pressure values.
The application shows the differences
between the results of the algorithm and
the values of reference and it helps to
qualify the algorithm according to the
international standards.
On this project I worked together with
Peter Toth (Labtech Ltd.), Istvan Juhasz
(Faculty of Informatics, University of
Debrecen), and two students from the
Faculty of Informatics, University of
Debrecen, Bela Vamosi and David Angyal.
My contribution to this project was to
construct and implement a signal processing
algorithm which produces the blood pressure
values and the pulse values of a
measurement.
2.4 Education of database programmingfinding out how we can characterize the m