1. What’s a typical scenario to use Apache Kylin?
Ans: Kylin can be the best option if you have a huge table (e.g., >100 million rows), join with lookup tables, while queries need be finished in the second level (dashboards, interactive reports, business intelligence, etc), and the concurrent users can be dozens or hundreds.
2. How large a data scale can Kylin support? How about the performance?
Ans: Kylin can supports second-level query performance at TB to PB level dataset. This has been verified by users like eBay, Meituan, Toutiao. Take Meituan’s case as an example (till 2018-08), 973 cubes, 3.8 million queries per day, raw data 8.9 trillion, total cube size 971 TB (original data is bigger), 50% queries finished in < 0.5 seconds, 90% queries < 1.2 seconds.
3. What’s the expansion rate of Cube (compared with raw data)?
Ans: It depends on a couple of factors, for example, dimension/measure number, dimension cardinality, cuboid number, compression algorithm, etc. You can optimize the cube expansion in many ways to control the size.
4. How to compare Kylin with other SQL engines like Hive, Presto, Spark SQL, Impala?
Ans: They answer a query in different ways. Kylin is not a replacement for them, but a supplement (query accelerator). Many users run Kylin together with other SQL engines. For the high frequent query patterns, building Cubes can greatly improve the performance and also offload cluster workloads. For less queried patterns or ad-hoc queries, their MPP engines are more flexible.
5. How to compare Kylin with Druid?
Ans: Druid is more suitable for real-time analysis. Kylin is more focused on the OLAP case. Druid has good integration with Kafka as real-time streaming; Kylin fetches data from Hive or Kafka in batches. The real-time capability of Kylin is still under development.
Many internet service providers host both Druid and Kylin, serving different purposes (real-time and historical).
Some other Kylin’s highlights: supports star & snowflake schema; ANSI-SQL support, JDBC/ODBC for BI integrations. Kylin also has a Web GUI with LDAP/SSO user authentication.
6. How to quick start with Kylin?
Ans: To get a quick start, you can run Kylin in a Hadoop sandbox VM or in the cloud, for example, start a small AWS EMR or Azure HDInsight cluster and then install Kylin in one of the node.
7. How many nodes of the Hadoop are needed to run Kylin?
Ans: Kylin can run on a Hadoop cluster from only a couple nodes to thousands of nodes, depends on how much data you have. The architecture is horizontally scalable.
Because most of the computation is happening in Hadoop (MapReduce/Spark/HBase), usually you just need to install Kylin in a couple of nodes.
8. How many dimensions can be in a cube?
Ans: The max physical dimension number (exclude derived column in lookup tables) in a cube is 63; If you can normalize some dimensions to lookup tables, with derived dimensions, you can create a cube with more than 100 dimensions.
But a cube with > 30 physical dimensions is not recommended; You even couldn’t save that in Kylin if you don’t optimize the aggregation groups. Please search the “curse of dimensionality”.
9. Why I got an error when running a “select * “ query?
Ans: The cube only has aggregated data, so all your queries should be aggregated queries (“GROUP BY”). You can use SQL with all dimensions to be grouped to get them as close as the detailed result, but that is not the raw data.
To be connected from some BI tools, Kylin tries to answer the “select *” query but please aware the result might not be expected. Please make sure each query to Kylin is aggregated.
10. How can I query raw data from a cube?
Ans:Cube is not the right option for raw data.
But if you do want, there are some workarounds. 1) Add the primary key as a dimension, then the “group by pk” will return the raw data; 2) Configure Kylin to push down the query to another SQL engine like Hive, but the performance has no assurance.
11. What is the UHC dimension?
Ans: UHC means Ultra High Cardinality. Cardinality means the number of distinct values of a dimension. Usually, a dimension’s cardinality is from tens to millions. If above million, we call it a UHC dimension, for example, user id, cell number, etc.
Kylin supports the UHC dimension but you need to pay attention to the UHC dimension, especially the encoding and the cuboid combinations. It may cause your Cube very large and query to be slow.
12. Can I specify a cube to answer my SQL statements?
Ans: No, you couldn’t; Cube is transparent for the end-user. If you have multiple Cubes for the same data models, separating them into different projects is a good idea.
13. Is there a REST API to create the project/model/cube?
Ans: Yes, but they are private APIs, incline to change over versions (without notification). By design, Kylin expects the user to create a new project/model/cube in Kylin’s web GUI.
14. How to define a snowflake model(with two fact tables)?
In the snowflake model, there is only one fact table also. But you could define lookup table joins with another lookup table.
If the query pattern between your two “fact” tables is fixed, just like fact A left join with fact B. You could define fact B as a lookup table and skip the snapshot for this huge lookup table.
15. Where does the cube locate, can I directly read the cube from HBase without going through Kylin API?
Ans: Cube is stored in HBase. Each cube segment is an HBase table. The dimension values will be composed as the row key. The measures will be serialized in columns. To improve storage efficiency, both dimension and measure values will be encoded to bytes. Kylin will decode the bytes to origin values after fetching from HBase. Without Kylin’s metadata, the HBase tables are not readable
16. How to encrypt cube data?
You can enable encryption at HBase side. Refer https://hbase.apache.org/book.html#hbase.encryption.server for more details.
17. How to schedule the cube build at a fixed frequency, in an automatic way?
Kylin doesn’t have a built-in scheduler for this. You can trigger that through Rest API from external scheduler services, like Linux cron job, Apache Airflow, etc.
18. How to view Kylin cube’s HBase table without encoding?
To view the origin data, please use SQL to query Kylin. Kylin will convert the SQL query to HBase access and then decode the data. You can use Rest API, JDBC, ODBC drivers to connect with Kylin.
19. Does Kylin support Hadoop 3 and HBase 2.0?
From v2.5.0, Kylin will provide a binary package for Hadoop 3 and HBase 2.
20. The Cube is ready, but why the table does not appear in the “Insight” tab?
Make sure the “Kylin.server.cluster-servers” property in conf/kylin.properties is configured with EVERY Kylin node, all jobs, and query nodes. Kylin nodes notify each other to flush the cache with this configuration. And please ensure the network among them are healthy.
21. What should I do if I encounter a “java.lang.NoClassDefFoundError” error?
Kylin doesn’t ship those Hadoop jars, because they should already exist in the Hadoop node. So Kylin will try to find them and then add to Kylin’s classpath. Due to Hadoop’s complexity, there might be some case a jar wasn’t found. In this case please look at the “bin/find-*-dependency.sh” and “bin/kylin.sh”, modify them to fit your environment.
22. How to add dimension/measure to a cube?
Once a cube is built, its structure couldn’t be modified. To add dimension/measure, you need to clone a new cube, and then add in it.
When the new cube is built, please disable or drop the old one.
If you can accept the absence of new dimensions for historical data, you can build the new cube since the end time of the old cube. And then create a hybrid model over the old and new cube.
23. How to solve the data security problem of the Tableau connection client?
Kylin’s ACL control can solve this problem. Different analysts have the authority to work on different projects for Kylin. When you create a Kylin ODBC DSN, you can map different links to different analyst accounts.
24. The query result is not exactly matched with that in Hive, what’s the possible reason?
Possible reasons:
a) Source data changed in Hive after built into the cube;
b) Cube’s time range is not the same as in Hive;
c) Another cube answered your query;
d) The data model has inner joins, but the query doesn’t join all tables;
e) Cube has some approximate measures like HyberLogLog, TopN;
f) In v2.3 and before, Kylin may have data loss when fetching from Hive, see KYLIN-3388.
25. What to do if the source data changed after being built into the cube?
You need to refresh the cube. If the cube is partitioned, you can refresh certain segments.
26. What is the possible reason for getting the error ‘bulk load aborted with some files not yet loaded’ in the ‘Load HFile to HBase Table’ step?
Kylin doesn’t have permission to execute HBase CompleteBulkLoad. Check whether the current user (that run Kylin service) has the permission to access HBase.
27. Why bin/sample.sh cannot create the /tmp/kylin folder on HDFS?
Run ./bin/find-hadoop-conf-dir.sh -v, check the error message, then check the environment according to the information reported.
28. In Chrome, web console shows net:: ERR_CONTENT_DECODING_FAILED, what should I do?
Edit $KYLIN_HOME/tomcat/conf/server.xml, find the “compress=on”, change it to off.
29. How to configure one cube to be built using a chosen YARN queue?
Set the YARN queue in Cube’s Configuration Overwrites page, then it will affect only one cube. Here are the three parameters:
kylin.engine.mr.config-override.mapreduce.job.queuename=YOUR_QUEUE_NAME
kylin.source.hive.config-override.mapreduce.job.queuename=YOUR_QUEUE_NAME
kylin.engine.spark-conf.spark.yarn.queue=YOUR_QUEUE_NAME
30. How to add a new JDBC data source dialect?
That is easy to add a new type of JDBC data source. You can follow such steps:
1) Add the dialect in source-hive/src/main/java/org/apache/kylin/source/jdbc/JdbcDialect.java
2) Implement a new IJdbcMetadata if {database that you want to add}’s metadata fetching is different from others and then register it in JdbcMetadataFactory
3) You may need to customize the SQL for creating/dropping table in JdbcExplorer for {database that you want to add}.
31. How to ask a question?
Check Kylin’s documents first. and do a Google search also can help. Sometimes the question has been answered so you don’t need to ask again. If no matching, please send your question to Apache Kylin user mailing list: user@kylin.apache.org; You need to drop an email to user-subscribe@kylin.apache.org to subscribe if you haven’t done so. In the email content, please provide your Kylin and Hadoop version, specific error logs (as much as possible), and also the how-to re-produce steps.
“bin/find-hive-dependency.sh” can locate hive/hcat jars in local, but Kylin reports error like “java.lang.NoClassDefFoundError: org/apache/hive/hcatalog/mapreduce/HCatInputFormat” or “java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/session/SessionState”
Kylin needs many dependent jars (Hadoop/hive/chat/HBase/Kafka) on the classpath to work, but Kylin doesn’t ship them. It will seek these jars from your local machine by running commands like HBase classpath, hive -e set, etc. The founded jars’ path will be appended to the environment variable HBASE_CLASSPATH (Kylin uses the HBase shell command to start up, which will read this). But in some Hadoop distribution (like AWS EMR 5.0), the HBase shell doesn’t keep the origin HBASE_CLASSPATH value, which causes the “NoClassDefFoundError”.
To fix this, find the hbase shell script (in hbase/bin folder), and search HBASE_CLASSPATH, check whether it overwrites the value like :
export HBASE_CLASSPATH=$HADOOP_CONF:$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$ZOOKEEPER_HOME/*:$ZOOKEEPER_HOME/lib/*
If true, change it to keep the origin value like:
export HBASE_CLASSPATH=$HADOOP_CONF:$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$ZOOKEEPER_HOME/*:$ZOOKEEPER_HOME/lib/*:$HBASE_CLASSPATH
Get “java.lang.IllegalArgumentException: Too high cardinality is not suitable for dictionary – cardinality: 5220674” in “Build Dimension Dictionary” step
Kylin uses “Dictionary” encoding to encode/decode the dimension values. Usually, a dimension’s cardinality is less than millions, so the “Dict” encoding is good to use. As the dictionary needs to be persisted and loaded into memory if a dimension’s cardinality is very high, the memory footprint will be tremendous, so Kylin adds a check on this. If you see this error, suggest identifying the UHC dimension first and then re-evaluate the design (whether the need to make that as a dimension?). If must keep it, you can by-pass this error with a couple of ways: 1) change to use other encodings (like fixed_length, integer) 2) or set a bigger value for Kylin.dictionary.max.cardinality in conf/Kylin.properties.
32. Why Kylin need to extract the distinct columns from Fact Table before building the cube?
Kylin uses a dictionary to encode the values in each column, this greatly reduces the cube’s storage size. To build the dictionary, Kylin needs to fetch the distinct values for each column.
33. Why Kylin calculate the HIVE table cardinality?
The cardinality of dimensions is an important measure of cube complexity. The higher the cardinality, the bigger the cube, and thus the longer to build and the slower it to query. Cardinality > 1,000 is worth attention and > 1,000,000 should be avoided at best effort. For optimal cube performance, try to reduce high cardinality by categorizing values or derived features.
The password hash for pre-defined test users can be found in the profile “sandbox, testing” part; To change the default password, you need to generate a new hash and then update it here, please refer to the code snippet in https://stackoverflow.com/questions/25844419/spring-bcryptpasswordencoder-generate-different-password-for-same-input
When you deploy Kylin for more users, switch to LDAP authentication is recommended.
34. How to update the default password for ‘ADMIN’?
By default, Kylin uses a simple, configuration-based user registry; The default administrator ‘ADMIN’ with password ‘KYLIN’ is hard-coded in kylinSecurity.xml. To modify the password, you need firstly get the new password’s encrypted value (with BCrypt) and then set it in kylinSecurity.xml. Here is a sample with password ‘ABCDE’
cd $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib
java -classpath kylin-server-base-2.3.0.jar:spring-beans-4.3.10.RELEASE.jar:spring-core-4.3.10.RELEASE.jar:spring-security-core-4.2.3.RELEASE.jar:commons-codec-1.7.jar:commons-logging-1.1.3.jar org.apache.kylin.rest.security.PasswordPlaceholderConfigurer BCrypt ABCDE
BCrypt encrypted password is:
$2a$10$A7.J.GIEOQknHmJhEeXUdOnj2wrdG4jhopBgqShTgDkJDMoKxYHVu
Then you can set it into kylinSecurity.xml
vi ./tomcat/webapps/kylin/WEB-INF/classes/kylinSecurity.xml
Replace the origin encrypted password with the new one:
<bean class=”org.springframework.security.core.userdetails.User” id=”adminUser”>
<constructor-arg value=”ADMIN”/>
<constructor-arg
value=”$2a$10$A7.J.GIEOQknHmJhEeXUdOnj2wrdG4jhopBgqShTgDkJDMoKxYHVu”/>
<constructor-arg ref=”adminAuthorities”/>
</bean>
Restart Kylin to take effect. If you have multiple Kylin servers as a cluster, do the same on each instance.
35. What kind of data be left in ‘kylin.env.hdfs-working-dir’ ? We often execute Kylin cleanup storage command, but now our working dir folder is about 300 GB in size, can we delete old data manually?
The data in ‘hdfs-working-dir’ (‘hdfs:///kylin/kylin_metadata/’ by default) includes intermediate files (will be GC) and Cuboid data (won’t be GC). The Cuboid data is kept for the further segments’ merge, as Kylin couldn’t merge from HBase. If you’re sure those segments won’t be merged, you can move them to other paths or even delete them.
pay attention to the “resources” or “jdbc-resources” sub-folder under ‘/kylin/kylin_metadata/’, which persists big metadata files like dictionaries and lookup tables’ snapshots. They shouldn’t be manually moved.
36. How to escape the key word in the fuzzy match (like) queries?
”%”, “” are keywords in the “like” clause; “%” matches any character, and “ ” matches a single character; When you want to match the keyword like “ ”, need to escape them with another character ahead; Below is a sample with “/” to escape, the query is to match the “xiao”:
“select username from gg_user where username like ‘%xiao/_%’ escape ‘/’; “
37. What are the Benefits of using Apache Kylin as a Data Source:
Easy Access to Massive Datasets: Interactively work with large amounts (TB/PB) of data.
Blazing Fast Performance: Get sub-second response times to your queries on Big Data.
High Scalability: With Kylin’s linear scalability, scale up your data without worrying about performance.
Web-Scale Concurrency: Deploy to thousands of concurrent users.
Minimal Data Engineering: Invest time in discovering insights and leave the data engineering to Apache Kylin.
38, what is the support for mdx?
We don’t support the MDX query now. The query entry is SQL. Like MDX-based operations like saiku, the community has already contributed Mondrian jar package. It can convert mdx provided by saiku foreground to SQL, and then send it to Kylin through jdbc jar. Server, but the function is limited, left join, topN, count distinct support is limited.
39. Kirin for the T-level data, how long does it take to make a cube every day?
The specific cube construction time depends on different situations, depending on the number of dimensions and different combinations, Cardinality size, source data size, Cube optimization degree, cluster computing power, and other factors. In some cases, it takes only a few tens of minutes to build tens of gigabytes of data in a shared cluster. It is recommended that you test in the actual environment to find the point where you can optimize the cube. Also, in general, the incremental construction of the Cube can be triggered automatically by the system after the completion of the ETL. Often, this time and the analyst’s data analysis is a peak.
40. how to submit code to Kylin?
Make the modified code into a patch file with git format-patch, then attach it to the corresponding Jira, Kylin committer will come to review, if there is no problem, it will merge to the development branch.
41.If the data is in elastic search, what is the support of Kylin?
At present, it is not supported to extract data directly from es. It needs to be exported to hive and then cube build. Interested students can implement an es data source based on Kylin 1.5’s plugin architecture.
42. what is the better front-end drag control for work?
At present, tableau support is better, saiku support is not very good, some scenarios such as left join, count distinct, top support is not very good, users can develop their own drag and drop pages based on API.
43.What is the difference between Q6, Community, and Business?
The commercial version provides greater security, stability, reliability, and good integration of enterprise components; as well as reliable, professional, source-level commercial support.
44.How is the performance of multiple concurrent support?
Kylin and other MPP architecture technologies must have a big advantage in high concurrency. A Kylin Query Server supports tens to hundreds of QPS (depending on the complexity of the query, the configuration of the machine, etc.), and Kylin supports a benign horizontal extension, ie, increasing Kylin server and HBase nodes can increase rapidly concurrent.
45, can Kylin integrate spark machine learning and Spark SQL?
Based on the pluggable architecture mentioned above, it can be integrated.
46. comparison with other tools, there is no time to consider building the cube? Because people are calculated in real-time, you are pre-calculated, from which the mechanism is not the same
Kylin comparison with other technology in architecture MPP query performance, time is free time to build a cube, so in a sense, this comparison is somewhat unfair. However, from a user perspective, analysts and end-users only care about query performance, and is expected to be considered by Kylin can significantly improve query speed, which is required by the user!
47. Kylin ODBC driver with sample code?
The current code is on the master branch. You are welcome to join the community to contribute.
48. 400 million data is a bit less, has Kirin ever done the relevant benchmark, how is it in the case of tens of billions of data and ten latitudes?
Test data from the community, on a cube (26TB) of nearly 28 billion raw data, 90% of the queries were completed in 5 seconds.
49. If the number of data doubles, will space use increase exponentially?
Usually, the growth of the cube is the same as the growth of the original data, that is, the original data doubles, the cube doubles, or is smaller; instead of exponential growth.
Keywords
Apache Kylin training, Apache Kylin Online training institute, Apache Kylin jobs, Apache Kylin online training, Apache Kylin jobs in Hyderabad, Apache Kylin training in Hyderabad, Apache Kafka jobs in Chennai, Apache Kylin openings in Pune, Apache Kylin training certification, Apache Kylin training course content, Apache Kylin online training from India, Apache Kylin training classes, Apache Kylin Interview Questions, and Answers, Apache Kylin study material, Apache Kylin classes, Apache Kylin tutorial, Apache Kylin Job Support, Apache Kylin Best Training, Apache Kylin free training, Apache Kylin training courses, Apache Kylin training and Placement, Apache Kylin certification course online
online courses|computer courses|online teaching sites|online classes|best free online courses with certificates|
online tutorial sites|online learning courses|online training
India|US|UK|Canada|Australia|Germany|Philippines|New Zealand|Switzerland|Singapore|Saudi Arabia|Sweden|Russia|Romania|South Korea
|Qatar|Poland|Portugal|Papua New Guinea|Paraguay|Oman|Nigeria|Norway|Netherlands|Mexico|Morocco|Monaco|Malaysia|Luxembourg|Liechtenstein|
Kenya|Kuwait|Italy|Ireland|Indonesia|Hungary|Greece|Georgia|France|Finland|Ethiopia|Estonia|Denmark|Czechoslovakia|Belgium|Bahrain|Brazil|
Bulgaria|Austria|sriLanka
USA|Alabama|Alaska|Arizona|Arkansas|California|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Hawaii|Idaho|Illinois|
Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Mississippi|Missouri|Montana|Nebraska|Nevada|
New Hampshire|New Jersey|New Mexico|New York|North Carolina|North Dakota|Ohio|Oklahoma|Oregon|Pennsylvania|Rhode Island|South Carolina|
South Dakota|Tennessee|Texas|Utah|Vermont|Virginia|Washington|West Virginia|Wisconsin|Wyoming
Alberta|British Columbia|Manitoba|New Brunswick|Newfoundland|Northwest Territories|Nova Scotia|Nunavut|Ontario|Prince Edward Island|
Quebec|Saskatchewan|Yukon
London|Birmingham|Glasgow|Liverpool|Bristol|Manchester||Sheffield|Leeds|Edinburgh|Leicester
Sydney|Albury|Armidale|Bathurst|Blue Mountains|Broken Hill|Campbelltown|Cessnock|Dubbo|Goulburn|Grafton|Lithgow|Liverpool
|Newcastle|Orange|Parramatta|Penrith|Queanbeyan|Tamworth|Wagga Wagga|Wollongong