Presentation on theme: "Hari Babu Fujitsu Australia Software Technology In-memory columnar storage."— Presentation transcript:
1 Hari Babu Fujitsu Australia Software Technology In-memory columnar storage
2 About me I work for “Fujitsu Australia Software Technology Pvt Ltd”. I am a member of a team that develops “FUJITSU Enterprise Postgres” based on PostgreSQL. I have around 9+ years of experience in developing database products (in-memory and disk). 1
3 Contents Introduction Need of columnar storage Need of in-memory Architecture Overview of columnar storage approach Write optimized storage Read optimized storage Data movement from WOS to ROS SQL operations In-memory approach Performance comparison with other commercial databases Current state of the patch Further improvements 2
4 Introduction We at Fujitsu are working towards adding analytical and enterprise capabilities to PostgreSQL. The columnar storage is the first outcome in this goal of providing analytical capabilities. Fujitsu laboratories database team has designed and developed the columnar storage. 3
5 Need of Columnar storage During analytical query processing on a large data sets, because of row wise data storage, most of the unwanted columns data is scanned and discarded, as a result of this the performance is affected. Because of row wise data storage, the disk IO is also increased in case of large data sets even if the row data is compressed. By storing data in columns rather than rows, the database can more precisely access the data it needs to answer a query and also compression efficiency. It is well known that a row of similar data, dates for example, can be compressed more efficiently than disparate data across rows. 4
6 Need of in-memory In-memory data access is an approach to querying data when it resides in a computer’s random access memory, as opposed to querying data that is stored on physical disks. This results in faster query response times. Allowing transactional and analytical applications to simultaneously access the same database is another way to provide real-time analytics capabilities can cause performance problems because of resource conflicts largely due to latency in accessing the data records stored in disk. 5
7 Architecture 6
8 Overview of Columnar storage approach Columnar storage is implemented as an extension with minimal changes in the backend code. The extension needs to be loaded as part of shared_preload_libraries during server start. This extension adds an new index access method called VCI (vertical clustered index). User needs to create a VCI index on columns that are to be part of columnar storage. 7
9 Overview of Columnar storage approach To support a proper columnar storage without affecting the performance of write operations and also by providing a good performance improvement to read operations, the best way to achieve it is by splitting storage into two different types. WOS – write optimized storage Temporary storage type used for holding the modification of write operations. ROS – Read optimized storage Permanent storage type of columns that are part of the columnar storage. To support above design, whenever VCI index is created, it internally creates many relations. 8
10 Write optimized storage write optimized storage is the place where the data of all columns that are part of columnar storage are stored in a row wise format temporarily. The following two relations that are part of WOS storage Data WOS - contains the tuple data of all columns that are part of VCI Whiteout WOS - contains a set of TIDs whose tuple data were called to be deleted write optimized storage is responsible for MVCC behaviour in our design. we adopted this approach in order to simplify the data structure of ROS. 9
11 Write optimized storage The following diagrams explains data page layout of Data WOS and White out WOS relations. All the newly added/deleted data is stored in WOS relation with xmin/xmax information also. If user wants to update/delete the newly added data, it doesn't affect the performance much compared to deleting the data from columnar storage. 10
12 Write optimized storage The tuples which don't have multiple copies or frozen data will be moved from WOS to ROS periodically by the background worker process called WOS to ROS converter process. Every column data is stored separately in it's relation file. There is no transaction information is present in ROS. The data in ROS can be referred with tuple ID. 11
13 Read Optimized Storage 12 Read optimized storage is the location where the actual column relation data is represented. In further slides we will discuss the following details that are important for ROS. Overview of ROS ROS Extent ROS column data access ROS Delete vector
14 Read Optimized Storage 13 Read optimized storage is where all the columns data is individually stored in separate data files. Differently to WOS, ROS data is not managed by the TID used in the original table. When the data is transferred from WOS to ROS, an internal ID called Columnar Record Identifier (CRID) is assigned to the set of columnar data (one data for each column). CRID is a combination of extent number and offset of the record in the extent.
15 Read Optimized Storage 14 The CRID gives a logical position of the columnar data, and is generated in increasing order of record registration. Columnar data are stored in column data relations, one for each column, and the CRID is used to find the position of the columnar data that compose a record in each different column data relation. Extent is introduced as a unit of data management on ROS. One extent is assumed to be a unit that contains a fixed number of consecutive CRIDs. When a large amount of records are transferred to ROS during the WOS→ROS conversion, the storing is divided and executed in units of extents.
16 Read Optimized Storage 15 ROS Extent: An extent is a logical data block in ROS. One extent contains 262,144 records. In order to obtain the position from CRID with a couple of operations, the number of records in an extent is fixed, including unused CRID. Since the column-element sizes differ from each other, the sizes of extents differ as shown in figure.
17 Read Optimized Storage 16 The position of an extent in the relation, the algorithm of compression of ROS data, and dictionaries for compression are specified per extent. The WOS to ROS translation is also performed by unit of extent. The garbage collection of deleted rows is performed for each extent. The gathered unused CRIDs at the end of an extent are reused when a new row is appended to the extent. An extent with columns of fixed column-element lengths occupies a fixed number of DB pages. Therefore, a fixed number of continuous DB pages are assigned to an extent, to prevent fragmentation.
18 Read Optimized Storage The following figures explains the details of mapping of an extent to DB pages and the information stored in an extent. 17
19 Read Optimized Storage In PostgreSQL, very long data with variable length format are TOASTed externally. In order to reduce data size in columns, the information of TOAST links is stored in the data area in extent, and the highest two bits in the column-element length data are used to indicate whether the datum is a link or a raw datum. This is shown in the below figure and table. 18 Value of two bitsKind of data 00Raw data 10External TOAST
20 Read Optimized Storage ROS Column data access: The following figure explains the details of how the column data relation can be reached from a VCI main relation. 19
21 Read Optimized Storage For fixed column length datatypes, both compressed and raw, the column size is known, and the positions of the data are directly calculated from CRID. The extent ID is obtained by dividing CRID by the number of rows in an extent, and the remainder is the position in the extent. For columns with variable lengths, by recording the offset of the data location for each column-element in the CRID offset, the data can referred. 20
22 Read Optimized Storage 21 ROS Delete Vector: Delete vector is a bit vector of representing whether the row is deleted from ROS or not. Delete vector itself is also a fixed- length column data, and the eight bits for continuous eight rows are packed into a byte data in ROS. In processing queries, only rows without deletion marks are used. After some periods, deleted rows are collected by copying- garbage-collection. The live rows are copied from the beginning of an empty extent, and this allows getting consecutive unused CRIDs together at the end of the extent as a result.
23 Read Optimized Storage 22 The unused CRIDs are reused for WOS to ROS transformation in the future. The reason to use copying- garbage collection is to continue processing queries during garbage collection. Copying-garbage collection copies only live data in an extent into an unused contiguous area inside the relation. Queries issued before copying can access data in an original source extent area, while queries issues after copying can access data in a copied destination extent area. As soon as there are no queries left that access data in the source extent area, the source extent area is reclaimed. Thus, queries can access data without stopping even in garbage- collection.
24 Read Optimized Storage 23 The following diagram explains the details of how the garbage collection works.
25 Data movement from WOS to ROS 24 A background worker process called ROS daemon similar like autovacuum launcher, launches worker processes that does the following 1. WOS to ROS conversion 2. Update delete vector(Whiteout WOS to delete vector conversion) 3. Collect deleted rows in an extent 4. Update TIDCRID relation with TID-CRID update list 5. Collect unused extent
26 SQL operations 25 All the INSERT/DELETE operations directly takes place at WOS relations and these will be periodically transferred to ROS.
27 SQL Operations 26
28 SQL Operations During each query execution, Data WOS & Whiteout WOS corresponding to the columnar storage table will be converted into Local ROS. Life of Local ROS ends with each query execution. The extent number of local ROS is shaken from -1 to small one. All the visible data from the WOS and whiteout WOS is transformed into Local ROS at the start of the query execution. 27
29 In-memory approach As we already checked need of in-memory approach, columnar storage to work properly, we need to make sure that most of the data resides in shared buffers along with OLTP operations. Instead of separate mechanism of in-memory approach, we can try to reuse the shared buffers logic. To achieve the same, we can do something like the following, Reserve some shared buffers space to columnar storage tables. The shared buffer pages that are used by the columnar storage tables will be recycled only when their usage crosses the specified reserved ratio. This way it will be ensured that most of the columnar table pages resides in shared buffers. 28
30 In-memory approach New GUC configuration parameter to specify reserve ratio for columnar storage tables. Reserve_buffer_ratio – (0 – 75) Create a separate shared buffer pool for columnar storage tables similar like oracle multi buffer pools, but this needs proper design changes. A new reloption can be added to specify the table that needs the stable buffer option. 29
31 Performance comparison with other commercial databases 30 PostgreSQLVCI VCI Performance RDBMS1 RDBMS2RDBMS3 RDBDMS3 40CPU RDBDMS3 1CPU RDBMS3 40CPU VCI off1CPU40CPU1CPU40CPU ColumnRow 1CPU Row 40CPU Row 1CPU In memory option 40CPU In memory option Execution time (ms) Q1 91,84126,0901,00425,942975 4,3122,227532 38,9351,61832,3071,119 Q2 4,8944,7844,4494,9632,060 1,3874,6061,601 5316853021,006 Q3 28,99916,99614,69016,8052,701 8,7165,9452,836 1,6781,473908954 Q4 5,0005,3733,1795,097843 1,5452,7842,484 15,7011,1908,825628 Q5 13,7098,4316,7909,0061,068 1,6174,6941,420 21,8241,79810,1992,213 Q6 9,0157,5164987,691534 284416154 8,79669245351 Q7 14,50213,11014,02913,9581,674 1,34219,567420 18,5681,8486,1251,415 Q8 5,2535,0224,6044,884605 1,88719,2331,504 17,3482,0243,5581,897 Q9 69,84369,13867,27666,4116,387 150,000535,3581,214 33,6954,90418,2172,587 Q10 11,06411,3769,15211,4303,522 7434,5843,812 4,7181,5268871,371 Q11 1,3151,1711,1181,2981,319 616306763 2,689724763607 Q12 12,61017,4752,32917,468718 57134,863720 13,8151,1346,188651 Q13 60,43359,24358,34260,30960,454 1,89911,3586,763 28,4151,26121,574898 Q14 3,9938,4702,3478,5461,517 1,787750191 15,0461,0131,768495 Q16 20,25219,79619,19820,20219,332 1,0377,357605 4,1531,3262,8051,371 Q17 1,8771,6801,4081,631209 2761,6409,420 5612612,128503 Q18 43,76942,30242,60244,54143,099 3,64721,7681,840 16,7201,58713,8622,227 Q19 1,4861,0803951,10599 21426,308297 1,3501001,074309 Q20 2,0131,8591,6051,977303 17910,797564 2,92620,1002,2992,441 Q21 24,86324,63024,22825,4952,323 13,79434,9461,557 167,8178,360167,4176,998 Q22 3,2702,5031,0232,630331 5696,872305 2,189524 3,368 775
32 Performance comparison with other commercial databases 31 VCI VCI performance RDBMS2RDBMS1 RDBMS3(1CP U) RDBMS3(40C PU) RDBMS3(1CP U) RDBMS3(40CP U) 1CPU40CPU1CPU40CPU ColumnRow ROW In memory option Speed improvement rate Q1 3.591.53.594.2172.621.341.22.456.82.882.1 Q2 1.01.11.02.184.108.40.206.220.127.116.11 Q3 1.72.01.710.710.23.34.917.319.731.930.4 Q4 0.91.61.05.92.03.21.80.18.104.22.168 Q5 1.62.01.5122.214.171.124.126.96.36.199.2 Q6 188.8.131.52.958.531.721.71.013.019.9176.8 Q7 1.11.0 8.734.510.80.70.87.82.410.2 Q8 1.01.1 184.108.40.206.3 220.127.116.11 Q9 1.0 1.110.918.104.22.168.114.23.827.0 Q10 1.01.21.03.12.922.214.171.124.312.58.1 Q11 126.96.36.199 188.8.131.52.184.108.40.206 Q12 0.75.40.717.617.5220.127.116.111.12.019.4 Q13 1.0 8.918.104.22.168.92.867.3 Q14 0.51.70.52.622.214.171.124.126.96.36.199 Q16 1.01.11.0 33.5188.8.131.52184.108.40.206 Q17 220.127.116.11.00.26.81.13.18.104.22.168 Q18 1.0 23.812.02.02.6222.214.171.124 Q19 126.96.36.1994.95.06.90.188.8.131.52.8 Q20 1.11.31.06.63.6184.108.40.206.10.90.8 Q21 1.0 10.716.01.80.70.13.00.13.6 Q22 220.127.116.11.910.75.70.51.56.21.04.2 Geometric average 18.104.22.168.22.214.171.124.37.52.510.0 Geometric average (The one whose performance is lower than PG is removed). 126.96.36.199.188.8.131.52.89.33.711.3 Slower than PostgreSQL
33 Current state of the patch The IMCS patch is implemented as an new index access method called vertical clustered index. All the storage part of the code is completed. Currently with the help of custom plan, the plans are generated to use VCI index. There are some workarounds are added to deal with HOT update and vacuum operations as IMCS is currently treated as an index. 32
34 Further improvements Need to integrate the storage changes with the proposed new columnar storage create syntax. Planner changes to choose the columnar storage advantage instead of using custom plan methods to generate the plan. Remove all the workarounds that are added for HOT update. 33
35 Questions? 34
36 35 Copyright 2014 FUJITSU LIMITED
Download ppt "Hari Babu Fujitsu Australia Software Technology In-memory columnar storage."