regression: a data analysis api
from scideas software

regression api definition

endpoint

https://services.scideas.net/regression/api

no specific headers required

method

POST

JSON in body of POST

POST fields

key

String <api key>

required

outcome_variable

String <outcome variable label in csv file>

required

ignore_variables

Array <variable names to ignore from data>

optional

convert_date_to

String <week or month>

optional

data

Array <csv file rows>

required, see notes below on preparation

Response fields

data_count

number of data rows supplied

outcome_variable

variable used as the outcome

tested_variables

variables used in analysis

number_observations

number of data rows used for analysis

number_tests

number of data rows used to test accuracy of analysis

dates_converted_to

value used for date conversion

empty if no date conversion has been made

calls

associative array of api calls use

prediction_mean_accuracy

mean error of predictions as a percentage of actual outcome value

a value below 100 means model underestimates outcome

standardized_coefficients

associative array of integers giving a measure of the relative contribution of each variable

negative means higher variable tends to make outcome lower

summary

array of text summaries, one for each tested variable

pdf

url for the generated pdf, showing summary and charts

pdf is available for download for 24 hours after generation

regression api data preparation

The data to be analyzed must first be collated into a csv file. The first line must contain the variable names. Include the output variable in the file. Here is an example csv file for sales data.

Once you have your csv file you need to encode the contents as a JSON string to use in the data POST field.

There are some very important considerations to be made in choosing data, don't just include all the data you have !

Do not include similar variables

Regression analysis does not work well if two or more similar variables are included. Similar here means correlated.

In the csv file above the variables State and Postal Code would be highly correlated so we don't want to include them both. By using the ignore_variables POST field in the request we can tell the api to ignore the Postal code variable.

Dates

Regression analysis requires numeric data. The api will convert categoric variables, like State for example, into numeric variables. Every different value for the categoric variable will be converted to a different number.

This will be done in the order new values are found in the data so in the above example Kentucky would be 1, California 2 etc. The order and number don't matter.

If dates were converted like this, since every date is different, every single date would be converted to a different number and this wouldn't be at all useful. You probably want to know how the month or the week in the year affect the outcome.

To help with this there is a convert_date_to POST field. By giving it the value of either week or month the api will convert a date variable to weeks (1 to 52) or months (1 to 12). It can recognise either US style mm/dd/yyyy or standard dd-mm-yyyy (or dd.mm.yyyy) and it will convert any field whose name includes the word "date".

Alternatively prepare your own field of numeric months, weeks, hours or whatever and call the variable Months etc, anything without the word "date".

Code example

See here for a PHP code example.

regression api result interpretation

Standardized coefficients

This is an array of the coefficients determined by the regression analysis. These should only be understood as indicating the relative influence of each variable. For example a value of 20 for variable A and a value of 1 for variable B means that variable A makes 20 times more difference to the outcome than variable B. It does not mean that if you double variable A the outcome will change by 40 times.

The sign of the coefficient indicates the direction of influence. For a negative coefficient, increasing the variable value will tend to lower the outcome value.

Prediction mean accuracy

Most of the data rows are used in the analysis. Some are kept back and used to test the model created against the actual outcome values supplied. This gives an indication of how successful the analysis has been. The nearer the prediction_mean_accuracy value is to 100 the more the model can be trusted.

Summary

This is a text summary for each variable included in the analysis. It simply attributes an adjective to different ranges of coefficient value.

Where appropriate the summary also includes an indication of the direction of influence; does a higher variable value make the outcome higher or lower ? For categoric variables that have been encoded to numeric values, this indication can only be made for binary categories, for example sex (male or female) or perishable goods (yes or no).

PDF

The PDF may be downloaded using the url in the pdf field returned. The document includes the text summary and also a chart for each of the variables included in the analysis, plotted with the outcome. The charts are normalized so that the variable and outcome appear on the same scale. The charts can sometimes be useful to visualize the text summary and to check that the summary makes sense. For example, it is often easy to see that the variable tends to go up when the outcome goes down for a variable with a negative coefficient.

OAS3

Download the openapi 3.0.0 definition file for regression

Test mode

regression: a data analysis api from scideas software