Open Source Overview

Open Source Overview

I have worked with, written about and reviewed over a hundred open source projects over the last five years either in systems built, books written or via blogs, papers and presentations. The question that always comes to my mind is how to build systems from all of these varied products and how to have an awareness of them that is wide enough to encompass their ever changing mass, range and grouping of versions.

If I want to build an open source big data system I might stick to the tried and trusted components such as Hadoop, HDFS, Spark, Oozie, Hive, Sqoop etc. But with an ever changing variety what else is available now and might be available in the near future ? How can I determine which versions of products will safely work together ?

I am well aware of integration projects like Apache BigTop and stack based releases like MapR, Cloudera and what was HortonWorks. But I am trying to think beyond their offerings into an ever changing landscape. I can explain the problem by largely concentrating on the apache.org incubating and current projects which number around 400.

The database table structure below comprises of a simple system list and associated system type as well as some system type tags to allow for a basic classification strategy. 


By populating the system table system table with the hundred or so projects that I have worked with so far I can begin to get a basic open source overview and search the data. Then using some basic SQL I can search for systems that allow graph processing.

SELECT
"system"."name", "systemtype"."name",
"system"."type2", "system"."type3",
"system"."type4"
FROM
"systemtype", "system"
WHERE "systemtype"."ID" = "system"."type" AND
(
LOWER ( "system"."type2" ) LIKE 'graph' OR
LOWER ( "system"."type3" ) LIKE 'graph' OR
LOWER ( "system"."type4" ) LIKE 'graph' )

This produces the following list of systems from my data, you may not agree with the classification tags or the actual data which in truth has been created in haste. But hopefully you will agree that there is further hidden structure to this data. 


For instance, JanusGraph is the successor to Titan the big data graph product from Aurelius. Also Tinkerpop is actually a Gremlin backed graph framework which powers both of these products. So the hidden graph in this data actually looks something like this. 


So this graph implies a knowledge of the history of each system as well as the dependencies of each system. If I now want to search on system algorithms supported by each product I can add another couple of tables as shown below. The database table structure below shows two new tables connected to the system table, algorithm and algorithm type. 


This allows systems to be searched using the example SQL below, imagine that I wanted to find a system that supported the Naive Bayes or K-Means algorithm. Then I might search like this.

SELECT
"system"."name",
"systemtype"."name",
"MachineLearningAlgorithm"."algorithm",
"AlgorithmType"."AlgType"

FROM
"system",
"systemtype",
"MachineLearningAlgorithm",
"AlgorithmType"

WHERE
"system"."type" = "systemtype"."ID" AND
"MachineLearningAlgorithm"."systemid" = "system"."ID" AND
"AlgorithmType"."ID" = "MachineLearningAlgorithm"."algorithmtype" AND

(
lower("MachineLearningAlgorithm"."algorithm") LIKE '%k-means%' OR
lower("MachineLearningAlgorithm"."algorithm") LIKE '%kmeans%' OR
lower("MachineLearningAlgorithm"."algorithm") LIKE '%bayes%'
)

Which would produce a system algorithm list from my current one hundred system list something like this. 


Again don't worry about the actual data, it is just an example provided in haste. I am just trying to clarify the nature of a larger problem. So it is obvious that there are many attributes about system based projects that if collected in one place in one large database might prove useful. The next section will examine some of them.
System Attributes

This article finally brings me to the sort of data that might be useful about each open source system when trying to choose components to build a stack. The following list of system attributes is suggestive of what could be useful. You can probably think of a few more items.
  • History
  • Release History
  • Version History
  • Functionality
  • Algorithms Supported
  • Plugins
  • API's Available
  • Development Languages Supported
  • Licensing
  • Supporting Organization
  • Dependencies ( with version )
  • Version Mapping
As you could see from the graph above history is useful as we can determine a product's state and possible successors. Release history is also useful as we can determine the latest release version and incubator state ( release history would include a version list ) . A categorized list of functionality would allow us, as shown, to search a system database. Algorithms supported would allow systems to be found which support specific desired algorithms, i.e. Naive Bayes.

Plugins, API's and languages supported would be searchable and might supply the desired functionality to meet our needs. Licensing and supporting organization might be important because not all licenses are as supportive to commerce as the Apache 2 license.

Dependencies form an important chain along with their versions of the products required to allow a system to function. This is a necessary and essential part of the information required when adopting a system stack component. Finally I have included version mapping in this list. By that I mean the list of system versions that have been integration tested together and are known to work well with each other. Think of the BigTop project or a MapR or Cloudera system stack version map.

All of this information stored in one large searchable database would be invaluable when choosing potential system products. Perhaps you could search on system algorithms supported and then receive a report on a chosen system containing all of the information above.

Imagine the time and money potentially saved if developers, analysts, architects and project managers had access to such a vast collection of data ? Why do I say that this is a vast data landscape ? Surely I have only mentioned 400 Apache projects so far ?

Open Source Organisations And Beyond

There are many more open source organizations that just Apache, some which deal with software and many which are supporting organizations. As the rough list below shows there are many open source software organizations providing many products. 


In reality an integrated stack may contain open source and non open source components. It might contain cloud based and non cloud based services as well as third party code. Many products and systems provided by different organizations might have relationships or dependencies upon each other. For instance Apache Spark can run on both Mesos and Kubernetes. Giving another example a project might use the open source MapR Hadoop based stack but then also use the commercial product Qlik (Attunity) Replicate to move the data from an RDBMS to HDFS.

There is also a vast array of open source and commercial projects and companies available. Imagine all of the projects supported by open source organizations, all of the *.io and *.ai web based products ( h20.ai being one example ). Add in all of the independent organizations both open source and commercial. The possibilities and system related data begin to scale out into a large repository.

It would be very useful to have a single repository of software systems information for all of these systems be they open source or commercial containing all of the attributes listed here and more. That way choosing system stack components for the future would not need to rely on suggestions, the knowledge of friends or potential Google searches.

It is just a thought that has been bugging me as a widen my search across the open source domain and try to gain a greater understanding of what systems are available and what they can do. Perhaps you would care to comment ?

Comments

Post a Comment

Popular posts from this blog