# Advanced Analytics

The advanced analytics features of the OpenText™ Analytics suite include visualizations and capabilities not typically seen in modern BI products, such as Venn Diagrams, Evolution charts and Profiling / Z-Score, . Even if you're not a statistician, you can easily take your data analysis to the next level with pre-built predictive algorithms such as Forecasting, Decision Tree, Clustering, Association Rules, Logistic Regressions, Linear Regressions, Correlations and Naive Bayes.

### Scrutinizing big volumes of data and build predictive models in seconds

Visual Data Mining and Predictive Analysis typically requires expertise in specific programming languages, and requires an immense amount of time to code and test. In contrast, the Big Data Analytics feature provides the most commonly used data mining algorithms pre built and ready for use, no code or statistical expertise required. The analytic techniques allow a user to easily find out not only what has happened, but why it happened and what is likely to happen next.

The collection of algorithms included to the suite comprise those used most prevalently in business today, and which can serve the widest variety of analytic use cases.

### Pivot Table

Cross Tabulation or a Crosstab is an analytic process that summarizes data through pivots of different fields from the data source. The discrete values in each field are used as labels for either rows or columns.

This kind of analysis shows in a simple way the relationship between two or more fields. These kinds of calculations have a high computational complexity and are most effective if tied to millions of rows of data, but OpenText Analytics can handle them and calculate the cross-tab in a glance.

### Venn Diagram

A Venn Diagram provides an analysis of data identifying logical relations between groups of segments. It shows coincidences and differences.

Venn Diagrams are visual representations that discover hidden relationships between data sets. It’s commonly used in cross-selling, loyalty and churn / attrition analysis.

Using the advanced analytics capabilities of OpenText Analytics you can display up to five different segments. Results for segments beyond five appear in a table format only because the visual representation will be too confusing.

### Bubble Charts

Bubble Charts display three dimensions of data. It shows the distribution of categorical data across two axes of numeric variables and then uses a third variable to set the size of each bubble. These charts are useful when displaying big data in a simple graphical way.

Instead of showing measures by attributes or categorical variables, as in the case of crosstabs, a bubble chart shows groups or categories according to numeric variables.

A Bubble Chart could be used to examine customers by age, order amounts, and average salaries in a retail related business.

### Evolution

Evolution analysis shows a progression of data over time by examining the behavior of certain measures in different periodic scenarios. For example, examine how the sales of some product families evolve over a period of months. The product family field is the categorical variable under study, while the different scenarios are determined by the different values in the month of purchase variable.

### Profile/Z-Score

A Profile Analysis groups values and determines their relatedness to another group, called the profile segment. This analysis helps you draw a profile of a group of values from attributes selected in data stored.

The profile carries out a comparison with each of the attributes of a segment that are selected based on all the database’s values, including the analysis segment itself. In other words, this analysis shows the significance of the attributes to define the segment, versus all other values in the database.

Alternatively, if a base segment is selected in addition to the profile segment, the analysis is based on the values that are part of that chosen segment. In the first case the segment is defined. In the second case, whether the attributes are suitable for showing differences between both groups (profile and base segment) is defined.

There are several indicators that measure the significance of the attributes to define the analysis segment, including Z-Score. The Z-Score determines whether the difference between two proportions is statistically significant. In the case of Profile, this determination is carried out between the group to be analyzed and the group considered Rest, values that belong to the attribute whose significance you want to measure but not to the analysis segment.

Profile/Z-Score: The Z-Score determines whether the difference between two proportions is statistically significant

### Map Charts

Map Charts visualize data on a geographical map. Analysis results appear as a choropleth map (thematic map) on which each predefined region is assigned a color or shade. Each shade corresponds to the magnitude of data values for that region. For example, a map analysis assigns each region in a country map a different shade of red that corresponds to the number of unemployed customers in each region.

Using a color gradient scheme when mapping quantitative data lets business users understand the findings immediately

Map Charts: Quantitative data mapped with OpenText Analytics uses a specific color gradient to show data variation properly

### Pareto

Pareto analysis represents Pareto’s 80-20 principle with available data. Pareto’s principle states that:

- The minority group of the population (approximately 20%) bears 80% of something
- The remaining majority group of the population (approximately 80%) bears 20% of something

For example:

- 20% of clients are responsible for 80% of turnover
- 80% of turnover comes from 20% of the product catalog

Our built-in algorithm uses this theory and visualizes the relationship between a numerical and a categorical variable in your data set simply by dropping data fields onto the chart editor

Pareto: Exploring the relationship between a numerical and a categorical variable in your data

### Clustering/K-Means

Clustering organizes data based on specified variables. The Clustering Algorithm produces segments of data that help identify groups with the largest number of attributes in common. In other words, clustering provides an idea about the similarity and differences between records in the same group.

There are several types of algorithms to build cluster from a data set: hierarchical, centroid, distribution, density and recent. The pre-built algorithm in OpenText Analytics is the centroid-based clustering, also known as k-means clustering. It’s one of the most popular clustering methods of Data Mining .

This algorithm uses continuous variables because clustering calculates the distance between values to set up a group, and only fields with continuous values work for clustering. Continuous means that there are many discrete values. Categorical variables, or fields with few discrete values like gender or occupation, doesn’t work. Also, k-means clustering needs to know the number of clusters that had to build.

Clustering is extremely useful for making an advanced segmentation on specific or professional groups. For example, customers grouped in the same category or cluster may have common demographic features, or marketing campaigns launched over distinct channels with common results and costs.

Clustering/K-Means: Making an advanced segmentation on specific or professional groups

### Forecasting/Holt Winters

Forecasting is a method of extrapolating or predicting data based on time series. The forecasting in OpenText Analytics uses the Holt-Winters method, a fast and easy to use technique. This algorithm is used by many companies to produce short-term demand forecasts when their sales data contain a trend and a seasonal pattern.

This algorithm is based in iteratively applying a formula to produce a time series and a forecast. This formula uses a weighted average of data prior to time 't' to provide a result for time 't'. This method consists of three components: the level, trend and seasonal component.

It is easily adaptable to any changes to make on trends and seasonal patterns. It can predict the monthly volume of sales or to anticipate to the number of orders over the next months.

### Decision Tree/C4.5

Decision Tree is a decision tool that uses a tree structure, defining distinct paths (branchs) from nodes (decision nodes) and final consequences after following a certain path. With this analysis it’s possible to classify objects using its attributes. This classification could be used as a prediction over that object. Decision Trees are really simple to understand and to interpret.

In OpenText Analytics we have implemented the C4.5 algorithm to build Decision Trees. This algorithm requires to be trained on sample data. The Decision Tree learns with each successive application of the predictive model and becomes more accurate.

Some patterns found by Data Mining algorithms, however, are invalid. Data Mining algorithms often find patterns in the training set that are not present in the general data set. This is called overfitting. To solve this problem, using the advanced capabilities of the OpenText Analytics suite you can test the predictive model on a set of data that differs from the training set.

The learned patterns are applied to the test set and the resulting output is compared to the desired output. For example, a data mining algorithm that distinguishes spam from legitimate emails is trained on a set of sample emails. Once trained, the learned patterns are applied to a test set of emails. The accuracy of the predictive model is measured from how many emails it classifies correctly.

Decision Tree is a widely-used Data Mining technique. It will help to discover the best products to recommend to customers, which customers are more likely to become churn and which leads are qualified or which aren’t.

### Association Rules

Association rule is a widely known and used technique that detects relationship patterns or the affinity in a very large amount of data. An association rule, usually, is of the form 'If a shopper purchases Item A and Item B, the shopper also purchases Item C.'

Initially, it was mainly used in market basket analysis, discovering patterns in purchase transactions recorded in point-of-sales in supermarkets. The results of the analysis are used as the basis for decisions about marketing activities, such as promotional pricing and product placements.

OpenText Analytics uses FP-growth algorithm to build their association rules. FP stands for frequent pattern and it’s an efficient and highly scalable method for data mining, outperforming other popular methods like the Apriori Algorithm or the Tree Projection. In fact, by applying our built-in FP-growth algorithm, you can quickly create an applicable set of rules over a large database. Additionally, from all the results obtained, only the most effective rules are automatically selected from the rest in order to provide key business information for the business analyst.

## Logistic Regressions

A Logistic Regression is a probabilistic and predictive model that provides a probability of occurrence of a binary response (dependent variable) based in the values of one or more predictor variables (independent variables). This model is built using a logistic function whose coefficients are calculated from the variety of occurrences of the binary response with the predictor variables.

To start this analysis, you just require a Domain -filter or a full table- where the Dependent variable belongs -the one we want to predict with logistic function- and the Independents variables, also known as predictors or explanatory variables.

## Linear Regressions

A Linear Regression is a simple and easy predictive model that is used to provide the "next value" from a dependent variable of some predictors, or explanatory variables. This model is built using a linear function and using the values of the predictors.

It has many practical applications and uses that can be divided in two types:

- Forecasting a dependent variable.
- Quantify the level of relationship between the dependent variable and the predictors.

### Correlations

A Correlation measures the dependence relationship between two or more continuous sets of data. For example, there is a direct linear correlation between the ice cream sales and temperature on a day. Once a segment is provided, OpenText Analytics calculates the distinct correlation coefficients -using Pearson's correlation coefficient- between all the possible combinations of pairs of sets of data.

A Correlation Matrix tabulates the results of the correlation coefficient between pairs of variables provided. This matrix is an upper diagonal matrix, where each cell shows the correlation coefficient for a certain combination of column and row. Each row and column represents the distinct continuous sets of data compared.

### Naive Bayes

The main advantage of this linear classification tool is that it requires relatively little training data.

This algorithm has an underlying probability model with statistically independent characteristics and the estimation of its parameters is based on maximum likelihood. It provides simple probability classification with highly scalable classifiers that require only a minimum number of predictors for training that produces significant results.

Training is very fast because it is done by evaluating a closed-form expression (taking linear time) rather than using iterative approximation which is used for most other types of classifiers. The hypothesis of independence of variables does not require knowing more than the variance of each variable for every class, without having to calculate a covariance matrix. The simple and intuitive use of the Naive Bayes tool is much the same as for the Linear and Logistics Regression tools.