Tuesday, October 17, 2017

THE DBDEBUNK DICTIONARY OF DATA FUNDAMENTALS

Updated 10/18/2017.
 

Given the ample misuse and abuse of terminology, a rigorous and comprehensive data fundamentals dictionary is long overdue. I have tentatively committed to one that is consistent with Codd's true RDM and its McGoveran interpretation -- as distinct from what passes for it in the industry -- to include (a) informal conceptual terms used properly and consistently and (b) accurate formal logical terms. The project will have two phases:

1. Expansion of this blog's search beyond the current Blogger label limitations;
2. Addition of term definitions and publication of a full fledged desk dictionary;



Search Improvement (Phase I)


Since its inception Google's Blogger -- the platform for this blog -- has had a 200 character limit on the set of labels to tag a post with. This constrains significantly the number of labels per post. Moreover, there are many more fundamental terms than can be practically included in the label list.

Having looked for and failed to find a widget or programmatic solution, the only way around these limitations is  (a) use acronyms and abbreviations for some of the labels and (b) search non-label terms using the Blogger search feature. 


  • A TERMINOLOGY page listing fundamental terms, some with acronyms, will be added to site's top menu (this will be the first, online component of the dictionary). Terms that are labels, or have acronyms as labels will be bold. The page will also contain a list of abbreviations for terms that are labels but lack sensible acronyms (see below).
  • Any reference in any post to a fundamental term will also include the acronym, if any (e.g., logical-physical confusion (LPC)). 
Because some acronyms may not be obvious and there is no way in Blogger to document what they stand for in the label list, some searches will be multi-step. The process will work as follows:

  • Is the term you want to search by, or its acronym -- if you recognize it -- on the label list?
  • If yes, search by it.
  • If not, check the TERMINOLOGY page:
  •   Is your term on it?
  • If yes, does is have an acronym label?
  • If yes, do a label search;
  • If not, do a full term  Blogger search.
  • If not, contact me via email to determine whether it should be added to the list.

Note very carefully, however:

  • For label searches the results are determined by my choice of labels, based on my judgment of relevance/significance (I may even assign a label to a post even if the term is not referred explicitly in the text, if I deem it implicitly significant). So you will end up with results "curated" (so to speak) by me.
  • For Blogger searches, subject to how the Blogger algorithm works, you may end up with all the posts with explicit references to a term/acronym, regardless of its significance.
While the Blogger search option was always there, now there is also the correct terminology to guide searches and serve as a learning resource, another idea behind this project.

Examples may help. Say you are looking for posts about logical-physical confusion, the acronym for which, LPC, is on the label list. If you know what LPC stands for, you can search by it. If you don't,

(1) Go to the TERMINOLOGY page 
(2) Look for your term listed with the acronym LPC
(3) Do a LPC label search. 

Suppose you're looking for posts about relation predicate. Neither it nor its acronym RP are on the label list.

(1) Go to the TERMINOLOGY page 
(2) They are both listed, but not bold
(3) Do a full term Blogger search by either the term, or its acronym.
 

Some terms that should be labels but lack sensible acronyms are further shortened, per the abbreviations mentioned above (e.g., rel = relational, log = logical, dt = data and so on).

It's a bit cumbersome, I know, but it's the best that can be done given Blogger limitations and the expanded search guided by the dictionary justifies the inconvenience.

A TERMINOLOGY draft page has been added to the site's top menu. Please check it out and provide feedback -- opinions, suggestions, corrections, ideas are all welcome -- via email. This is an opportunity to test your knowledge of fundamentals against their corruption in the industry.

After the page is finalized, the label list will be revised to be consistent with it.
All forthcoming and possibly some of the most recent posts will also abide by the described system. Time permitting I may go back and gradually revise older posts.


Phase II: Full Fledged Desk Dictionary


After the above system is implemented and works, the intention is, time permitting, to add definitions to all terms and publish THE DBDEBUNK DICTIONARY OF DATA FUNDAMENTALS - A DESK REFERENCE FOR THE THINKING DATA PROFESSIONAL AND USER, similar to THE DBDEBUNK GUIDE TO MISCONCEPTIONS.



Monday, October 9, 2017

This Week


1. Database Truth of the Week

“A DBMS using the RDM for all its functionality would be very limited. The RDM only requires that the declarative data sub-language employed by users for data manipulation -- has power not more expressive than first order predicate logic (FOPL), which implies acceptance of certain limitations on what users can do directly in the language, in return for
Language declarativity and decidability;
Semantic correctness and system-guaranteed logical validity;
Physical and logical independence;
Simplicity.”
                                                  --David McGoveran


2. What's Wrong With This Database Picture?

"The term database design can be used to describe many different parts of the design of an overall database system. Principally, and most correctly, it can be thought of as the logical design of the base data structures used to store the data. In the relational model these are the tables and views. In an object database the entities and relationships map directly to object classes and named relationships. However, the term database design could also be used to apply to the overall process of designing, not just the base data structures, but also the forms and queries used as part of the overall database application within the database management system(DBMS).

The process of doing database design generally consists of a number of steps which will be carried out by the database designer. Usually, the designer must:

  • Determine the data to be stored in the database.
  • Determine the relationships between the different data elements.
  • Superimpose a logical structure upon the data on the basis of these relationships.
Within the relational model the final step above can generally be broken down into two further steps, that of determining the grouping of information within the system, generally determining what are the basic objects about which information is being stored, and then determining the relationships between these groups of information, or objects." 
                             --Halil Lacevic, What is a Relational Database?, Quora.com

Monday, October 2, 2017

Understanding the Division of Labor between Analytics Applications and DBMS

 My October post @All Analytics

"I am coming across, on the one hand, instructions on how to do "analytics with SQL" and, on the other, tools purporting to enable "analytics without SQL." They are an umpteenth iteration of essentially similar ideas during my 30-plus years in data management and reflect common and entrenched fundamental misconceptions that I have documented and analyzed the costly consequences of in my writings and teachings. They will keep repeating, inhibiting genuine progress, as long as data fundamentals are ignored or dismissed. One of the least understood is the distinction between DBMS and application functions."

Read it all.



 

Sunday, October 1, 2017

Class, Type, Relation and Domain in Database Management

This is a 10/01/17 re-rewrite of a 08/12/12 post revised on 12/05/16 to bring it in line with David McGoveran's formal exposition and interpretation[1] of Codd's RDM (as distinct from its common "understanding" in the industry).

Here's what's wrong with last week's picture, namely:

"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases.
  • type = domain: I fully understand why one might equate type and domain, but ... in today's programming practice, type and domain are quite different. The word type is largely tied to system-level (or physical-level) definitions of data, while a domain is thought of as an abstract set of acceptable values.
  • class != relvar: In simple terms, the word class applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a relation is a class (and trivially also a set), which contributes to confusion.
In modern programming parlance class is generally distinguished from type only in that type refers to primitive (system-defined) data definitions while class refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, type and class are actually synonymous."
There is, indeed, a huge mess. And, as always, it is rooted in poor foundation knowledge[2], to which the comment itself is not immune. 

Friday, September 22, 2017

This Week

1. Database Truth of the Week

“If the data sub-language ... has the power of second order predicate logic (SOPL), expressions are possible that cannot be evaluated (for example, self-referencing expressions) and the formal language is then undecidable, an algorithm to implement a declarative query language is impossible and all hope of physical independence is lost." --David McGoveran


2. What's Wrong With This Database Picture?

"Our terminology is broken beyond repair. [Let me] point out some problems with Date's use of terminology, specifically in two cases.
"type" = "domain": I fully understand why one might equate "type" and "domain", but ... in today's programming practice, "type" and "domain" are quite different. The word "type" is largely tied to system-level (or "physical"-level) definitions of data, while a "domain" is thought of as an abstract set of acceptable values.

"class" != "relvar": In simple terms, the word "class" applies to a collection of values allowed by a predicate, regardless of whether such a collection could actually exist. Every set has a corresponding class, although a class may have no corresponding set ... in mathematical logic, a "relation" is a "class" (and trivially also a "set"), which contributes to confusion.
In modern programming parlance "class" is generally distinguished from "type" only in that "type" refers to "primitive" (system-defined) data definitions while "class" refers to higher-level (user-defined) data definitions. This distinction is almost arbitrary, and in some contexts, "type" and "class" are actually synonymous."

Sunday, September 17, 2017

Database Management: No Progress Without Data Fundamentals

I have recently -- yet again -- been accused in a LinkedIn exchange  of "gibberish without any evidence" and of claiming that "nobody know what they're doing" with databases. I will leave it to readers to judge whether (1) five decades worth of writings and teaching is "no evidence" and (2) my comments in the exchange are gibberish. Here I would like to dare anybody to find claims to that effect in any of my pronouncements. What I did, do and will say is that most data professionals do not know and understand data and relational fundamentals -- an incontrovertible fact proved not just by me[1], but also by others[2,3] and that this inhibits real progress in database management. 

As I wrote two weeks ago:
"The RDM put database management on a formal, scientific foot. Consequently, tool experience and relational terminology are insufficient -- foundation knowledge is necessary. Unfortunately, most data professionals do not possess it, in part because they have been misled by the industry and in part because few go through an education -- as distinct from training -- program that teaches the RDM and teaches it correctly. Consequently, even those with the heart in the right place defend the RDM without a full understanding, their views distorted by what passes for it (stay tuned for a debunking of such a recent example)."
I will now fulfill the promise by debunking just such a "heart-in-the-right-place" defense of the RDM. 

Sunday, September 10, 2017

This Week

1. Database Truth of the Week

“A network is a directed acyclic graph (the "direction" of the transitive relationship) and, thus, amenable to transitive closure (TC). In the Relational Data Model (RDM) that usually means the smallest set that includes all the members that satisfy the transitive relationship in question (for the count of each object type the closure is computed and the count ignores level). While the Relational Data Model (RDM) can handle an important subset of graph theory via special graph domain operators and extensions to the original relational operators, which could be made efficient, it is a very difficult problem. Certain computations on finite sets such as TC are not in general computable in a language based on first order predicate logic (FOPL) that is declarative, decidable and supports physical independence (PI) -- a core relational objective. They require a computationally complete language (CCL) that is imperative and recursive.
A ‘TC function’ can be implemented using a host CCL that returns its result in the form of a relation; then a symbol (i.e., pure syntax) of type relation can be defined in relational algebra that references/invokes that function. From within the algebra it appears to be just a relation and is up to the user to understand what the value of the returned relation means --i.e., that it represents the TC. That understanding/interpretation is outside the algebra and passed to users only via documentation (e.g., some meta-language).” --David McGoveran


2. What's Wrong With This Database Picture?

"I don’t like talking about the relational theory of data. It is absolutely fundamental to any deep understanding of data, but most practitioners get along fine without it. It’s more the implementers of database management systems (DBMSs) who need to understand relational theory, so teaching relational theory to ordinary practitioners is a bit like tormenting people with irrelevant theory before you let them get on with the business at hand. Moreover, some of those who understand relational theory use their knowledge to beat other people over the head with it. I don’t want to be associated with that high-handed approach to this important theory.

But I’ve been goaded. Google made me do it. My attention was drawn to a video put out by some folks at Google, Data Modeling for BigQuery. The video is fine for the most part, but it makes some misstatements about relational theory that just drive me crazy. They repeat commonly accepted misconceptions about relational databases—misconceptions that, unfortunately, have driven some of the “advances” we’ve seen of late in the realm of database technology. There have definitely been some true advances, but some new technology is merely different without being better.
If you’re a practitioner, designing, implementing, and using databases, whether SQL or NoSQL, this won’t matter much to you, although it never hurts to learn a little more about the theory of data. However, if you are a programmer who might be the one who builds the next NoSQL mega-star that will replace decades-old technology, you need to know this, because this knowledge will enable you to blind-side every established DBMS vendor, whether SQL or NoSQL." --Ted Hills, Understand Relational to Understand the Secrets of Data