Open Source Overview
Open
Source Overview
I have worked with, written about and
reviewed over a hundred open source projects over the last five years
either in systems built, books written or via blogs, papers and
presentations. The question that always comes to my mind is how to
build systems from all of these varied products and how to have an
awareness of them that is wide enough to encompass their ever
changing mass, range and grouping of versions.
If I want to build an open source big
data system I might stick to the tried and trusted components such as
Hadoop, HDFS, Spark, Oozie, Hive, Sqoop etc. But with an ever
changing variety what else is available now and might be available in
the near future ? How can I determine which versions of products will
safely work together ?
I am well aware of integration projects
like Apache BigTop and stack based releases like MapR, Cloudera and
what was HortonWorks. But I am trying to think beyond their offerings
into an ever changing landscape. I can explain the problem by largely
concentrating on the apache.org incubating and current projects which
number around 400.
The database table structure below
comprises of a simple system list and associated system type as well
as some system type tags to allow for a basic classification
strategy.
By populating the system table system
table with the hundred or so projects that I have worked with so far
I can begin to get a basic open source overview and search the data.
Then using some basic SQL I can search for systems that allow graph
processing.
SELECT
"system"."name",
"systemtype"."name",
"system"."type2",
"system"."type3",
"system"."type4"
FROM
"systemtype",
"system"
WHERE
"systemtype"."ID" = "system"."type"
AND
(
LOWER (
"system"."type2" ) LIKE 'graph'
OR
LOWER (
"system"."type3" ) LIKE 'graph'
OR
LOWER (
"system"."type4" ) LIKE 'graph'
)
This produces the following list of
systems from my data, you may not agree with the classification tags
or the actual data which in truth has been created in haste. But
hopefully you will agree that there is further hidden structure to
this data.
For instance, JanusGraph is the
successor to Titan the big data graph product from Aurelius. Also
Tinkerpop is actually a Gremlin backed graph framework which powers
both of these products. So the hidden graph in this data actually
looks something like this.
So this graph implies a knowledge of
the history of each system as well as the dependencies of each
system. If I now want to search on system algorithms supported by
each product I can add another couple of tables as shown below. The
database table structure below shows two new tables connected to the
system table, algorithm and algorithm type.
This allows systems to be searched
using the example SQL below, imagine that I wanted to find a system
that supported the Naive Bayes or K-Means algorithm. Then I might
search like this.
SELECT
"system"."name",
"systemtype"."name",
"MachineLearningAlgorithm"."algorithm",
"AlgorithmType"."AlgType"
FROM
"system",
"systemtype",
"MachineLearningAlgorithm",
"AlgorithmType"
WHERE
"system"."type"
= "systemtype"."ID" AND
"MachineLearningAlgorithm"."systemid"
= "system"."ID" AND
"AlgorithmType"."ID"
= "MachineLearningAlgorithm"."algorithmtype" AND
(
lower("MachineLearningAlgorithm"."algorithm")
LIKE '%k-means%' OR
lower("MachineLearningAlgorithm"."algorithm")
LIKE '%kmeans%' OR
lower("MachineLearningAlgorithm"."algorithm")
LIKE '%bayes%'
)
Which would produce a system algorithm
list from my current one hundred system list something like this.
Again don't worry about the actual data,
it is just an example provided in haste. I am just trying to clarify
the nature of a larger problem. So it is obvious that there are many
attributes about system based projects that if collected in one place
in one large database might prove useful. The next section will
examine some of them.
System
Attributes
This article finally brings me to the
sort of data that might be useful about each open source system when
trying to choose components to build a stack. The following list of
system attributes is suggestive of what could be useful. You can
probably think of a few more items.
- History
- Release History
- Version History
- Functionality
- Algorithms Supported
- Plugins
- API's Available
- Development Languages Supported
- Licensing
- Supporting Organization
- Dependencies ( with version )
- Version Mapping
As you could see from the graph above
history is useful as we can determine a product's state and possible
successors. Release history is also useful as we can determine the
latest release version and incubator state ( release history would
include a version list ) . A categorized list of functionality would
allow us, as shown, to search a system database. Algorithms supported
would allow systems to be found which support specific desired
algorithms, i.e. Naive Bayes.
Plugins, API's and languages supported
would be searchable and might supply the desired functionality to
meet our needs. Licensing and supporting organization might be
important because not all licenses are as supportive to commerce as
the Apache 2 license.
Dependencies form an important chain
along with their versions of the products required to allow a system
to function. This is a necessary and essential part of the
information required when adopting a system stack component. Finally
I have included version mapping in this list. By that I mean the list
of system versions that have been integration tested together and are
known to work well with each other. Think of the BigTop project or a
MapR or Cloudera system stack version map.
All of this information stored in one
large searchable database would be invaluable when choosing potential
system products. Perhaps you could search on system algorithms
supported and then receive a report on a chosen system containing all
of the information above.
Imagine the time and money potentially
saved if developers, analysts, architects and project managers had
access to such a vast collection of data ? Why do I say that this is
a vast data landscape ? Surely I have only mentioned 400 Apache
projects so far ?
Open
Source Organisations And Beyond
There are many more open source
organizations that just Apache, some which deal with software and
many which are supporting organizations. As the rough list below
shows there are many open source software organizations providing
many products.
In reality an integrated stack may
contain open source and non open source components. It might contain
cloud based and non cloud based services as well as third party code.
Many products and systems provided by different organizations might
have relationships or dependencies upon each other. For instance
Apache Spark can run on both Mesos and Kubernetes. Giving another
example a project might use the open source MapR Hadoop based stack
but then also use the commercial product Qlik (Attunity) Replicate
to move the data from an RDBMS to HDFS.
There is also a vast array of open
source and commercial projects and companies available. Imagine all
of the projects supported by open source organizations, all of the
*.io and *.ai web based products ( h20.ai being one example ). Add in
all of the independent organizations both open source and commercial.
The possibilities and system related data begin to scale out into a
large repository.
It would be very useful to have a
single repository of software systems information for all of these
systems be they open source or commercial containing all of the
attributes listed here and more. That way choosing system stack
components for the future would not need to rely on suggestions, the
knowledge of friends or potential Google searches.
It is just a thought that has been
bugging me as a widen my search across the open source domain and try
to gain a greater understanding of what systems are available and
what they can do. Perhaps you would care to comment ?
blog will be going to tell you the process of doing sync of the Arlo setup camera
ReplyDeletewe also provide you with the information on it is essential to update it or not. Even we will tell you the important notes on the sat nav, with its installation process