Friday, August 9, 2019

Data Sublanguage Part 2: Data Definition


Note:  For consistency with this series of four posts, I have revised the corresponding section of paper #2 in the Understanding the Real RDM series, Logical Access, Data Sublanguage, Kinds of Relations, Database Redundancy, and Consistency, available for ordering from the PAPERS page.
Please support this site and keep it free by regular donations via the HOME page and/or purchasing papers and BOOKS. Thanks.


Per Part 1, a data manipulation sublanguage (DMsL) is the relationally compete part of a computationally complete host programming language (CCL) tasked strictly with data manipulation (data retrievals and updates).

“Let us denote the data sublanguage by R and the host language by H. R permits the declaration of relations and their domains [and] the specification for retrieval of any subset of data from the data bank ... A set so specified may be fetched for query purposes only, or it may be held for possible changes ... H permits supporting declarations which indicate, perhaps less permanently, how these relations are represented in storage.”[1,2]

  • The retrieval part ("fetched for query purposes only") is a syntactic concretization of the relational algebra (RA) -- Codd's application of simple set theory (SST) expressible by a subset of first order predicate logic (FOPL) -- in effect, the linguistic expression of the RDM: relational operations (manipulation) on semantically constrained (integrity) relations (structure)[3];
  • The update part ("held for possible changes") -- a change properly understood as relation transformation, not replacement -- can be expressed, albeit with some convolution, in RA form. In other words, both retrievals and updates derive relations from relations;
  • Relations are in normal form (i.e. no RVDs and, thus, no sets of sets).

Note: There is a set theory understanding of updating that avoids the (very anti-set theory) of the "time varying relations" gloss: an update does not replace a relation (the conventional understanding in the industry), but derives another that bears a relationship to it[4].

Monday, July 22, 2019

Data Sublanguage Part 1: Relational vs. Computational Completeness



Note:  For consistency with this series of four posts, I have revised the corresponding section of paper #2 in the Understanding the Real RDM series, Logical Access, Data Sublanguage, Kinds of Relations, Database Redundancy, and Consistency, available for ordering from the PAPERS page.
Please support this site and keep it free by regular donations via the HOME page and/or purchasing papers and BOOKS. Thanks.


Revised 7/24/2019.
“Recently I have read that SQL is actually a data sublanguage and not a programming language like C++ or Java or C# ... The answers ... have the pattern of "No, it is not. Because it's not Turing complete.", etc, etc. ... I am a bit confused, because since you can develop things through SQL, I thought it is similar to other programming languages ... I am curious about knowing why exactly is SQL not a programming language? Which features does it lack? (I know it can't do loops, but what else more?)”
--StackOverflow.com

“The SQL operators were meant to implement the relational algebra as proposed by Dr. Ted Codd. Unfortunately Dr. Codd based some of his ideas on a "extended set theory", which was an idea formulated and described in a 1977 paper by D. L. Childs ... But Childs’ extensions were not ideally suited, which is explained in quite some detail in [a] book ... by Professor Gary Sherman & Robin Bloor [who] argue that mainstream Zermelo-Fraenkel set theory (Cantor), would have been a better starting point. One key issue is that sets should be able to be sets of sets.”
--Dataversity.net

Lack of foundation of knowledge and of familiarity with the history of the field characterizes not just the average data practitioner (as in the first quote), but also "experts" (as in the second quote), and they are unaware of the consequences. The concept of a data sublanguage exists because the claim in the second quote is false.

Friday, June 21, 2019

Data Meaning and Mining: Knowledge Representation and Discovery


Note: This is a re-write -- prompted by a LinkedIn exchange -- of two columns I published @All Analytics.
“Scientific research experiments that "require assignment of data to tables, which is difficult when the scientists do not know ahead of time what analysis to run on the data, a lack of knowledge that severely limits the usefulness of relational [read: SQL] databases.”
NoSQL are recommended in such cases. But what does "scientists do not know ahead of time what analysis to run" really mean?

Data, Information, and Knowledge


One way to view the difference between data, information, and knowledge is:
“1. Data: Categorized sequences of values representing some properties of interest, but if and how they are related is unknown (e.g., research variables in scientific experiments);
2. Information: Properties further organized in named combinations -- "objects", but how they are related is unknown (e.g., "runs", or "cases" in scientific experiments);
3. Knowledge: Relationships among properties and among objects of different types are known.”

--David McGoveran


Friday, June 14, 2019

Normalization and Further Normalization Part 3: Understanding Database Design


Note: This is a re-write of two older posts, to bring them into line with McGoveran's formalization, re-interpretation, and extension[1] of Codd's RDM.
 

In Part 1 we explained that for a database to be relational, database design must adhere to three core principles, in which case it consists of relations that are by definition in both 1NF and 5NF. In Part 2 we showed that whether tables visualize relations (i.e., are R-tables) can be determined only with reference to the conceptual model that the database designer intended the database to represent (not what any users might think it does). This is obscured by the common and entrenched confusion/conflation of levels of representation and, consequently, of types of model -- conceptual, logical, physical, and data model -- that we have so often debunked[2].


Sunday, June 2, 2019

Normalization and Further Normalization Part 2: If You Need Them, You're Doing It Wrong


In Part 1 we outlined some fundamentals of database design, namely the distinction between normalization to 1NF, and further normalization (to "full" 5NF), and explained that they are necessary only to repair poor designs -- if you (1) develop a complete conceptual model and (2) formalize it properly using the RDM, (3) adhering to the three core principles of database design, you should end up with a relational database in both 1NF and 5NF.

Here we apply this knowledge to the typical request for "normalization" help we presented in Part 1.

Friday, May 31, 2019

Normalization and Further Normalization Part 1: Databases Representing ... What?


Note: This is a re-write of older posts (which now link here), to bring them into line with the McGoveran formalization, re-interpretation, and extension[1] of Codd's RDM.
“A particular bug-bear and a mistake that +90% of "data modelers" make, is analyzing "point in time" views of the business data and "normalizing" those values hence failing to consider change over time and the need to reproduce historic viewpoints. Let’s say we start with this list of data-items for a Sales-Invoice (completely omitting details of what’s been sold):
SALES-INVOICE
 {Invoice-Date,
  Customer-Account-ID,
  Customer Name,
  Invoice-Address-Line-1,
  Invoice-Address-Line-2,
  Invoice-Address-Line-3,
  Invoice-Address-Line-4,
  Invoice-Address-Postcode,
  Net-Amount,
  VAT,
  Total-Amount
 };
Nearly every time, through the blind application of normalization we get this ... there’s even a term for it -- it’s called "over-normalization":
SALES-INVOICE
 {Invoice-Date,
  Customer-Account-Id
   REFERENCES Customer-Account,
  Net-Amount,
  VAT,
  Total-Amount
 };

CUSTOMER-ACCOUNT
 {Customer-Account-Id,
  Customer-Name,
  Invoice-Address
   REFERENCES Address
 };

ADDRESS
 {Address-Line-1,
  Address-Line-2,
  Address-Line-3,
  Address-Line-4,
  Postcode
 };”
A measure of scarcity of foundation knowledge in the industry are the attempts to correct a plethora of common misconceptions[2] that suffer from the very misconceptions they aim to correct. One of the most common fallacies is confusion of levels of representation[3] that takes two forms[4]. We have written extensively about the logical-physical confusion (LPC)[5,6,7,8] underlying "denormalization for performance"[9], and the conceptual-logical conflation (CLC) that lumps conceptual with data modeling[10,11,12], inhibiting understanding that the latter is formalization of the former.