Postgresql



Oracle to Postgres migration :

https://www.enterprisedb.com/blog/the-complete-oracle-to-postgresql-migration-guide-tutorial-move-convert-database-oracle-alternative



  •  Installation steps of postgresql cluster on Ubuntu machine.


1. Create the file /etc/apt/sources.list.d/pgdg.list and add a line for the repository

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list'
wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O - | sudo apt-key add -

2. Import the repository signing key, and update the package lists


wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -


sudo apt-get update


3. Ubuntu includes PostgreSQL by default. To install PostgreSQL on Ubuntu, use the apt-get (or other apt-driving) command:



apt-get install postgresql-11



  • Important commands : 
1. Start postgresql cluster 

pg_ctlcluster 11 main start



Important Notes : 



  • If you do not issue a BEGIN command, then each individual statement has an implicit BEGIN and (if successful) COMMIT wrapped around it. A group of statements surrounded by BEGIN and COMMIT is sometimes called a transaction block.
  • By default huge pages are off in AWS RDS /Aurora.
  • maintenance_work_memory : Specifies the maximum amount of memory to be used by maintenance operations, such as VACUUMCREATE INDEX, and ALTER TABLE ADD FOREIGN KEY.

Partitioning : 

Range Partitioning
The table is partitioned into ranges defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.
List Partitioning
The table is partitioned by explicitly listing which key values appear in each partition.
Hash Partitioning
The table is partitioned by specifying a modulus and a remainder for each partition. Each partition will hold the rows for which the hash value of the partition key divided by the specified modulus will produce the specified remainder.


Foreign Data : 

Postgresql allowing you to access data that resides outside PostgreSQL using regular SQL queries.



Postgres performance tuning tips : 

1. Install extension pg_state_statements to find query performance.
2. Avoid Heap fetches & Hash batches.  
3. Explain Analyze command actually execute the statement.
4. default_statistics_target : Sets the default statistics target for table columns without a column-specific target set via ALTER TABLE SET STATISTICS. Larger values increase the time needed to do ANALYZE, but might improve the quality of the planner's estimates. The default is 100.
5. Aurora Postgres specific features : 
  • Aurora Query Planner Management ( QPM) 
  • Cluster cache management 
  • Aurora Global database.
  • Aurora serverless
      



Postgres (Other topics ) : 

1. What is pg_toast ?


In PostgreSQL, pg_toast refers to a system schema used to store large data values that exceed a certain threshold size. It is primarily used for storing "toasted" data, which includes columns of types such as text, varchar, bytea, varbit, and oid that exceed the predefined size limit.

When a table contains large data values that cannot fit within a single database page, PostgreSQL automatically moves that data to the pg_toast schema. This process is known as "toast compression" or "toasting." The main purpose of toast compression is to optimize storage and improve performance by reducing the amount of disk space required for large data values.

When a large data value needs to be stored, PostgreSQL stores a small placeholder value in the main table and moves the actual data to the corresponding pg_toast table.

    Backup & Recovery :

When performing backup and restore operations on your PostgreSQL database, both the main tables and their associated pg_toast tables are included in the backup and restore process.

      Vacuum : 

The VACUUM process includes handling pg_toast tables and their associated data.
 

https://www.enterprisedb.com/postgres-tutorials/how-tune-postgresql-memory


2. Open Source Backup tool - Barman Vs pgBackRest.


Choosing between Barman and pgBackRest depends on your specific needs and the complexity of your backup and recovery requirements. If you're looking for simplicity and ease of use, Barman may be a good choice. If you require advanced features and performance optimizations, pgBackRest might be a better fit. Consider evaluating both tools based on your environment, backup requirements, and available resources to determine which one aligns better with your needs.


3. wal_level = replica or wal_level = logical ?

In summary, the key difference between wal_level = replica and wal_level = logical is the level of information written to the WAL. wal_level = replica provides the necessary data for physical replication, enabling streaming replication and hot standby setups. On the other hand, wal_level = logical includes additional metadata required for logical decoding, allowing for more flexible replication scenarios where specific tables or transformations need to be replicated.

** it is not possible to configure both logical and physical replication simultaneously from the same source database in PostgreSQL.

4. How to configure streaming replication ? 

https://www.enterprisedb.com/blog/how-set-streaming-replication-keep-your-postgresql-database-performant-and-date

5. statistics_target parameter. 

The statistics_target parameter accepts a value between 0 and 10000. The default value is 100, indicating that PostgreSQL will collect statistics on approximately 100 rows per table block. Increasing the value of statistics_target allows for more accurate statistics but can also increase the time and resources required to analyze the data.

6. Difference between EXPLAIN and EXPLAIN ANALYZE.


The EXPLAIN ANALYZE statement combines the functionality of EXPLAIN with the actual execution of the query. When you run EXPLAIN ANALYZE, PostgreSQL not only displays the execution plan but also executes the query and provides real-time performance information. It shows the actual time taken to execute each step, the number of rows processed at each stage, and other relevant statistics.

You can use auto_explain for DML but its not recommended for production due to performance  issues.


7. Why vacuum won't remove dead rows from a table?


    - Check bloated tables by querying pg_stat_all_tables.
    - We can verify the problem by running VACUUM (VERBOSE)
  • Long-running transactions and VACUUM

  • Abandoned replication slots and VACUUM.

  • Orphaned prepared transactions and VACUUM.

  • Standby server with hot_standby_feedback = on and VACUUM:

8. rule vs trigger : 



9. recovery_target_* parameter : 


By default, recovery will recover to the end of the WAL log. The following parameters can be used to specify an earlier stopping point. At most one of recovery_targetrecovery_target_lsnrecovery_target_namerecovery_target_time, or recovery_target_xid can be used

10.  Use postgresql log analyzer to troubleshoot performance issues.

Ref : https://severalnines.com/blog/postgresql-log-analysis-pgbadger/

11. How to remove table bloat?

First option is VACUUM :

The first thing should be the regular use of VACUUM command. This will remove the dead tuples by adding that space to free space map so that it can be reutilised by the database system. It will also try to remove index entries that point to these dead tuples. However this space will not be returned to the OS and will not shrink the table size.
Normally autovacuum background process can take care of that. The default setting are set to vacuum tables whenever their is 20% change in the table. So if you are seeing bloat’s frequently, make sure that autovacuum is enabled and also adjust the settings according to data size.

Reordering : 
  • VACUUM FULL
  • CLUSTER
  • pg_repack
  • REINDEX

12. GIN vs GiST index ?


In choosing which index type to use, GiST or GIN, consider these performance differences:

  • GIN index lookups are about three times faster than GiST

  • GIN indexes take about three times longer to build than GiST

  • GIN indexes are moderately slower to update than GiST indexes, but about 10 times slower if fast-update support was disabled (see Section 58.4.1 for details)

  • GIN indexes are two-to-three times larger than GiST indexes

As a rule of thumb, GIN indexes are best for static data because lookups are faster. For dynamic data, GiST indexes are faster to update. Specifically, GiST indexes are very good for dynamic data and fast if the number of unique words (lexemes) is under 100,000, while GIN indexes will handle 100,000+ lexemes better but are slower to update.

13. postgres_fdw vs dblink  ?

While the functionality in the dblink extension is similar to that in the foreign data wrapper, the Postgres foreign data wrapper is more SQL standard compliant and can provide improved performance over dblink connections.

14. VACUUM & ANALYZE command does not replicate in postgres.


15. Heap fetches, visibility map , 

https://www.postgresql.org/docs/12/runtime-config-resource.html


15. Logical Replication  : 


Logical replication currently has the following restrictions or missing functionality. These might be addressed in future releases.

  • The database schema and DDL commands are not replicated. The initial schema can be copied by hand using pg_dump --schema-only. Subsequent schema changes would need to be kept in sync manually. (Note, however, that there is no need for the schemas to be absolutely the same on both sides.) Logical replication is robust when schema definitions change in a live database: When the schema is changed on the publisher and replicated data starts arriving at the subscriber but does not fit into the table schema, replication will error until the schema is updated. In many cases, intermittent errors can be avoided by applying additive schema changes to the subscriber first.

  • Sequence data is not replicated. The data in serial or identity columns backed by sequences will of course be replicated as part of the table, but the sequence itself would still show the start value on the subscriber. If the subscriber is used as a read-only database, then this should typically not be a problem. If, however, some kind of switchover or failover to the subscriber database is intended, then the sequences would need to be updated to the latest values, either by copying the current data from the publisher (perhaps using pg_dump) or by determining a sufficiently high value from the tables themselves.

  • TRUNCATE commands are not replicated. This can, of course, be worked around by using DELETE instead. To avoid accidental TRUNCATE invocations, you can revoke the TRUNCATE privilege from tables.

  • Large objects are not replicated. There is no workaround for that, other than storing data in normal tables.

  • Replication is only possible from base tables to base tables. That is, the tables on the publication and on the subscription side must be normal tables, not views, materialized views, partition root tables, or foreign tables. In the case of partitions, you can therefore replicate a partition hierarchy one-to-one, but you cannot currently replicate to a differently partitioned setup. Attempts to replicate tables other than base tables will result in an error.


16. Hot (Heap Only Tuples)  updates ? 


Let's understand advantages and disadvantages of MVCC architecture : 

Advantages : 

  • no need for an extra storage area where old row versions are kept
  • ROLLBACK does not have to undo anything and is very fast
  • no overflow problem with transactions that modify many rows
Disadvantages : 

  • old, obsolete (“dead”) tuples have to be removed from the table eventually (VACUUM)
  • heavily updated tables can become “bloated” with dead tuples
  • every update requires new index entries to be added, even if no indexed attribute is modified, and modifying an index is much more expensive than modifying the table (order has to be maintained)

To allow for high concurrency, PostgreSQL uses multiversion concurrency control (MVCC) to store rows. However, MVCC has some downsides for update queries. Specifically, updates require new versions of rows to be added to tables. This can also require new index entries for each updated row, and removal of old versions of rows and their index entries can be expensive.

To help reduce the overhead of updates, PostgreSQL has an optimization called heap-only tuples (HOT). This optimization is possible when:

  • The update does not modify any columns referenced by the table's indexes, including expression and partial indexes.

  • There is sufficient free space on the page containing the old row for the updated row.

In such cases, heap-only tuples provide two optimizations:

  • New index entries are not needed to represent updated rows.

  • Old versions of updated rows can be completely removed during normal operation, including SELECTs, instead of requiring periodic vacuum operations. (This is possible because indexes do not reference their page item identifiers.)

In summary, heap-only tuple updates can only be created if columns used by indexes are not updated. You can increase the likelihood of sufficient page space for HOT updates by decreasing a table's fillfactor. If you don't, HOT updates will still happen because new rows will naturally migrate to new pages and existing pages with sufficient free space for new row versions. The system view pg_stat_all_tables allows monitoring of the occurrence of HOT and non-HOT updates.


A Heap Only Tuple is a tuple that is not referenced from outside the table block. Instead, a “forwarding address” (its line pointer number) is stored in the old row version.



A lower fillfactor can stimulate more HOT updates i.e. less write operations. Since we write less we also generate an less WAL writes. In our case we saved 10% of overall WAL only by reducing the fillfactor on two tables. Another advantageous effect of HOT updates is that they also ease the maintenance tasks on the table. After performing a HOT update, the old and new versions of the row are on the same page. This makes the single page cleanup more efficient and the vacuum operation has less work to do.


17. Difference between Aurora global database vs Cross - region read replicas ? 


Aurora Global database :  Physical replication with underlying storage using dedicated network link. Replication lag is minimal. 

Read replicas :  Its built-in RDBMS replication technology ( i.e. binlog for MySQL and PostgreSQL replication slots for PostgreSQL)


18. Native logical replication Vs pg_logical.


Native logical replication : 
  • Doesn't support bi-direction replication.



Pglogical : 

ref : https://aws.amazon.com/blogs/database/postgresql-bi-directional-replication-using-pglogical/
  • It does support bi-direction replication. Provide two main features.
    • prevent loop-back.
    • conflict resolution.  
  • Limitations : 
    • To replicate the updates and deletes on tables, pglogical requires a primary key or other valid replica identity such as a unique constraint. Without this unique identifier, pglogical can’t identify the updates and deletes. Additionally, foreign key constraints aren’t enforced during the replication process, so any successful operation on the publisher is assumed to be acceptable for the subscriber as well. Considering that foreign key constraints help maintain the referential integrity of data between child and parent tables, it’s important to understanding this limitation and ensure that your workload can’t cause any integrity violations.
    • Another important aspect to understand is the handling of sequences. Although pglogical provides a mechanism to sync up the sequence numbers, this doesn’t happen in real time. A better strategy is to have independent sequences on both nodes. For example, you may consider using only odd numbers on one node and only even numbers on the other node. This makes sure that there is no conflict of the sequence numbers.
    • UNLOGGED and TEMPORARY tables cannot be replicated.
    • DDL changes aren't automatically replicated between replicating nodes. While you may choose to run DDL changes on all replicating clusters individually, you can also use the pglogical function replicate_ddl_command to run the schema changes locally and replicate to the other node.  If transactions are in transit from one node to another, and at the same time the other node initiates the schema change, the in-transit data changes can’t write to the changed schema, and therefore the replication can break. The safe way to make schema changes to tables being changed on both nodes is to temporarily pause the write traffic.
    • Foreign keys constraints are not enforced for the replication process - what succeeds on provider side gets applied to subscriber even if the FOREIGN KEY would be violated.
    • PGLogical does not support replication between databases with different encoding. We recommend using UTF-8 encoding in all replicated databases.
    • Currently pglogical replication and administration requires superuser privileges.

19. rules vs trigger.

Trigger is fired once for each affected row. A rule modifies the query or generates an additional query. So if many rows are affected in one statement, a rule issuing one extra command is likely to be faster than a trigger that is called for every single row and must re-determine what to do many times. However, the trigger approach is conceptually far simpler than the rule approach, and is easier for novices to get right.
























Comments

Popular posts from this blog