First things first, a couple of definitions to clear up:
- ETL: Extract, Transform, Load. Transforms data before loading it into the target system
- extract raw data from various sources
- use a secondary processing server to transform that data
- load that data into a target database
- ELT: Extra, Load, Transform. Transforms data after loading it into the target system
- extract raw data from various sources
- load that data into a target database
- transform that data within the data system itself
Because there are often differences in the manner in which data is stored between database systems, migrating the data between these DBMS can be challenging. Therefore, in such cases, one can minimize efforts refactoring the application and instead perform ELT operations on the data itself in a dynamic fashion during the migration phase.
One of the more problematic types is date and timestamps.
There are three techniques that come to mind when migrating DATE datatypes between two RDBMS from a CSV file:
- Perform an ELT transformation, such as substituting the empty string for a NULL value, using a trigger while it loads into the table with the column of type “DATE”.
- Craft a custom PostgreSQL command that dynamically performs the ELT operation by transforming the empty string for a NULL value and inserting the data into a standard column of type “DATE” as it is copied.
- Create a custom datatype, thus permitting the direct loading of empty strings into a custom DATE column equivalent.
The first and second method migrates the data directly into standard postgres DATE columns. This method does assume, though, that empty strings will not be inserted into the database by the application in future transactions, post-migration.
Although the third method is the simplest and most straightforward method of migrating the data, some thought of administering the data post-migration may be required as well as it limits the DATE administration functionality available in PostgreSQL.
For our purposes, the third method, the use of custom datatypes and operators, is documented herein.
The steps are:
Step 1: Create function f_ex
Notice the commented RETURN values “null”, “-infinity”, “infinity”, “epoch” etc. Although this POC defaults returning NULL, investigating these other DATE constants can provide further return options:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
create or replace function f_ex(in x varchar) returns date as $$ begin raise notice 'entering function f_ex, the argument is "%" ',x; return null; -- return '-infinity'::date; -- return 'infinity'::date; -- return 'epoch'::date; end; $$ language plpgsql; |
Step 2: Create CAST
Using the aforementioned function, a CAST is created in the database and notice that it is defined as type “DATE”. Even in this current form, without using the DOMAIN, one creates a column storing this value however, it would have to be declared as a character datatype:
1 2 3 |
create cast (varchar as date) with function f_ex(varchar) as assignment; |
Step 3: Create DOMAIN
The DOMAIN coerces the inserted value to be saved in the table as a specific datatype, which has its own specific set of behaviors.
Two variations are demonstrated in order to highlight the different behaviors that can be introduced:
- Variation 1:
1 |
create domain uk_date1 as date; |
- Variation 2:
1 |
create domain uk_date2 as varchar; |
Validation/POC
Using the aforementioned CAST and DOMAIN, a table can now be created and populated with data.
The steps are:
- Create an example table
- Insert records into the example table
- Query the table
Step 1: Create table
Additional columns are declared in the table in order to demonstrate administering a stored value. See what it looks like:
1 2 3 4 5 6 |
create table example_table( id serial primary key ,date_field1 uk_date1 ,date_field2 uk_date2 ,date_default date ); |
ATTENTION: This SQL statement ensures that UK-style date inputs are implemented in the database.
1 |
alter database db01 set datestyle to 'SQL, DMY'; |
Step 2: Insert records
Notice there are not only are multiple date formats accepted, but both empty strings and NULLs can be successfully inserted:
1 2 3 4 |
INSERT INTO example_table (date_field1,date_field2,date_default) VALUES ('24/10/2024','24/10/2024','24/10/2024'); INSERT INTO example_table (date_field1,date_field2,date_default) VALUES ('24.10.2024','24.10.2024','24.10.2024'); INSERT INTO example_table (date_field1,date_field2,date_default) VALUES (now()::date,now()::date,now()::date); INSERT INTO example_table (date_field1,date_field2,date_default) VALUES (NULL,NULL,NULL); |
Step 3: Query
Example 1
Note the blank space generated by the NULL on row 4:
1 |
SELECT * FROM example_table; |
1 2 3 4 5 6 7 8 9 |
-- -- f_ex returns NULL -- id | date_field1 | date_field2 | date_default ----+-------------+-------------+-------------- 1 | 24/10/2024 | 24/10/2024 | 24/10/2024 2 | 24.10.2024 | 24/10/2024 | 24/10/2024 3 | 15/11/2024 | 15/11/2024 | 15/11/2024 4 | | | |
Example 2
Testing the use of empty spaces using the two different DOMAINs.
Variation 1: DOMAIN “uk_date1″… Type VARCHAR
Observe that Column date_field1 is coerced to a datatype VARCHAR by the DOMAIN uk_date1 while columns date_field2, date_default are CAST from type VARCHAR to DATE:
1 2 |
INSERT INTO example_table (date_field1,date_field2,date_default) VALUES ('',''::varchar,''::varchar); |
1 2 3 4 5 6 7 |
-- empty string "" is excluded because it can't be cast with this query select date_field1 as datatype_varchar ,(date_field1::text::date + '1 month'::interval)::date as date_field1_1month , date_default ,(date_default + '1 month'::interval)::date as date_default_interval from example_table where date_field1 != ''; |
Here is the resulting output; note the NOTICE comments raised in the body of function f_ex:
1 2 3 4 5 6 7 8 9 10 |
psql:go2_validate.sql:39: NOTICE: entering function f_ex, the argument is "" psql:go2_validate.sql:39: NOTICE: entering function f_ex, the argument is "" INSERT 0 1 datatype_varchar | date_field1_1month | date_default | date_default_interval ------------------+--------------------+--------------+----------------------- 24/10/2024 | 24/11/2024 | 24/10/2024 | 24/11/2024 24.10.2024 | 24/11/2024 | 24/10/2024 | 24/11/2024 15/11/2024 | 15/12/2024 | 15/11/2024 | 15/12/2024 (3 rows) |
Variation 2: DOMAIN “uk_date2” … type DATE
In this variation, column “date_field2” is coerced to type DATE. Notice in this case how both empty string and NULL are returned by this query:
1 2 3 4 5 |
select date_field2 as datatype_date ,(date_field2::text::date+'1 month'::interval)::date as date_field2_1month , date_default ,(date_default + '1 month'::interval)::date as date_default_1month from example_table; |
1 2 3 4 5 6 7 8 |
datatype_date | date_field2_1month | date_default | date_default_1month ---------------+--------------------+--------------+--------------------- 24/10/2024 | 24/11/2024 | 24/10/2024 | 24/11/2024 24/10/2024 | 24/11/2024 | 24/10/2024 | 24/11/2024 15/11/2024 | 15/12/2024 | 15/11/2024 | 15/12/2024 | | | | | | (5 rows) |
Caveat
Beware, function f_ex allows the insertion of any character string, which, of course, cannot conform to the PostgreSQL DATE type implementation. This can, however be mitigated by adding the relevant conditionals in the function call, as demonstrated by the following example;
1 2 |
insert into example_table (date_field1,date_field2,date_default) values ('abc','abc'::varchar,'abc'::varchar); |
1 2 3 4 |
select date_field1 as datatype_varchar ,date_field2 as datatype_date ,date_default from example_table; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
-- output psql:go2_validate.sql:67: NOTICE: entering function f_ex, the argument is "abc" psql:go2_validate.sql:67: NOTICE: entering function f_ex, the argument is "abc" INSERT 0 1 datatype_varchar | datatype_date | date_default -----------------+---------------+-------------- 24/10/2024 | 24/10/2024 | 24/10/2024 24.10.2024 | 24/10/2024 | 24/10/2024 15/11/2024 | 15/11/2024 | 15/11/2024 | | | | abc | | (6 rows) |
Conclusion
While the importation of data that includes empty strings by using a CAST and storing it as a DOMAIN is fairly straightforward, some level of consideration should be given to how one intends to manipulate the data once it’s there. At the end of the day, unless the application logic and querying are straightforward, one may have to either develop a more sophisticated set of queries or finally migrate to an altogether traditional table schema using standard data types.
Remember, it’s important to consider not only how the data gets to the database but also how it will be used.
Our PostgreSQL Performance Tuning eBook condenses years of database expertise into a practical guide for optimizing your PostgreSQL databases. Inside, you’ll discover our most effective PostgreSQL performance strategies derived from real-world experience.
“empty strings will not be inserted into the database by the application in future transactions”
I may miss the obvious, but how should this happen for a date column?
Hi,
> how should this happen for a date column?
Yes, I’ve seen this! It’s really all a matter of the source database system. Work long enough in this industry and you’ll find there’s always an exception to what is otherwise a very commonsense rule that to follow.