CSCI 241 Labs: Lab 8

CSCI 241 Labs: Lab 8
An Array of Problems

There are 5 checkpoints , including the clean-up checkpoint, in this lab. You and your partner should work together using just one of your accounts. CHANGE WHO IS CONTROLLING THE COMPUTER AFTER EACH CHECKPOINT! If you need help with any exercise, raise your hand.

Copy the lab materials to your account from /home/student/Classes/Cs241/Labs/Lab08

In this lab, you and your partner will write code to implement the linear regression calculations discussed in the prelab.

Loading the Data

Start running BlueJ and open project Lab08. Edit the Linear class. Linear's main() method is designed to hold one set of data for a linear regression. This includes its x-values, y-values and associated calculations. By the end of this lab, your Linear class will not only be able to calculate the values for linear regression, but also graph the associated function.

When we create a Java class, we need to make two kinds of decisions:

Implementation Decisions: Every time you write code, you make implementation decisions. For example, we give you the signature and return type (remember what that is?) of a method, and you write the code to fill in the body of that method. The combination of the signature and return type is called the method prototype.
Design Decisions: These are the types of decisions made by a more seasoned programmer. The types of decisions include: what class(es) do I need to accomplish the goal of implementing a system? How should I break up the tasks into separate methods? What variables and constants do I need?

Deciding on which classes to develop is an advanced topic that you usually see in upper-level computer science courses. We do know enough at this time to choose variables and methods.

We start by deciding which values to save as variables in the main() method. The main() method is located at the bottom of the class.

The prelab contains the needed formulas. We could decide to declare a variable for each of the variables in the formulas. Here is the first formula:

y = mx + b

It contains these variables:

y
m (the slope of the line)
x
b (the y-intercept of the line)

Just inside the main() method, you will see declarations for two arrays of doubles (one to hold the x-values and one to hold the y-values): xArray[] and yArray[].

Decisions:

When do we keep a value in a variable? If a value is calculated and used more than once, we keep it in a variable so we don't have to recalculate it.
When do we rely on another static method to calculate a value for us? One good time to do this is when we have a complex formula, like that for the slope.

The prelab contained this formula to calculate the slope (m):

     Sum(x_i*y_i) - n*x_avg*y_avg
m = ------------------------ , 
      Sum(x_i²) - n*x_avg²
where n is the number of points in the data set.

We'll save the slope (m) in a variable so we can draw a graph of the function later.

To help figure out this formula, you'll write a method that sums the products of each x and y pair:

Sum(x_i*y_i)

This value doesn't need to be saved in a variable because we can call that method again at any time to recalculate it. However, it will save processing time to keep it in a variable, since we use it for r² (correlation). Check the formula in the prelab and see where it is used.

Here are your tasks for the first checkpoint:

Add 3 new variables to the main() method to hold calculated values for:
- the slope (m), named m
- the intercept (b), named b
- the correlation squared (r²), named r2
Initialize each one to zero (0). We'll add the calculations later. Your code will read data from a file that contains real numbers. Note that there is already a variable declared named n. It will hold the number of points read in from the data file.

Just past the declarations and initializations of variables in the main() method you will see some code to read the data points from a file. It is currently incomplete. Your instructors have included a bit of "magical" code from Chapter 12 of the text. It is the try ... catch block. This code catches some of the errors likely to occur when trying to read from a file, including errors like "File not found". This error can occur when a file name is mistyped.
Our data input file contains several lines. The first line contains an integer which tells us how many lines of x and y values to expect in the lines of data that follow.

Here is an example of how we can read from a file that is set up in this way. size has already been declared as an int, and myArray has been declared as an array of ints.
```
Scanner file = new Scanner(new File(prefix + fileName));
size = file.nextInt();
myArray = new int [size];
// while the file is not empty AND the array still has room
for (int i = 0; file.hasNext() && i < myArray.length; i++)
{
    myArray[i] = file.nextInt();
}
```
In this example, the data part of the file contained one integer per line. Your file is a little different. Go to a terminal window, and cd into your Lab08 directory. Type
```
more lab08.dat
```
To read from this data file, change the prefix variable to contain the path to your Lab08 directory. (Remember, you can type pwd at the command line to see your current directory location.)
Complete the code between the try braces for this checkpoint. The comments tell you what to include. Because there are two numbers per line, you will need to call the Scanner's nextDouble() method twice to get both numbers.
The main() method should also print the values held in each array after reading the data from the file. There is a method named printArrayContents at the beginning of the class. Complete this method, using a loop (or two) to print the data in the array that comes from the parameter. Once you have finished all these parts for the checkpoint, test your program by uncommenting the lines in the main() method which call it.

1 Show us your declared variables and the code you wrote to read the data from the file. Run your main() method so we can see the results. Be ready to answer:

Why are the arrays for the x's and y's instantiated when we start reading data from the file, rather than at time of declaration?

Methods to Calculate the Sums

We will need 3 kinds of sums to do our calculations: sum of entries in an array, sum of the squares of the values in an array and sum of products of x and y values. Go back to the prelab and review the formulas for m, b and r². The Linear class will contain 3 different methods to calculate these sums. Each of these methods will take either one or two arrays as parameters. Here is what each part means in the prelab formulas:

Sum(x_i) and Sum(y_i) hold the sum of all the x-values and y-values, respectively. Once we get the values from these sums, we can easily calculate x_avg and y_avg.
Sum(x_i²) holds the sum of the squares of the x-values. We also see Sum(y_i²), which holds the sum of the squares of the y-values. These are calculated using similar mechanisms, so the same method works for both.
Sum(x_i*y_i) holds the sum of all the x*y values.

Your next task is to write 3 different static methods to calculate and return these values. The method prototypes are:

public static double sum(double [] array),
public static double sumOfSquares(double [] array) and
public static double sumOfProducts(double [] array1, double [] array2).

Each method should return a double which holds the result of the sum. Because we want these methods to work for different arrays, make sure you send the correct array (or arrays) to the methods as arguments.

To test your new methods, run them directly by right-clicking the Linear class in the BlueJ window. When the Method Call window pops up, you will need to provide arguments. When you need an array as an argument, you can type its content inside curly braces. For example, when running the the sum method, you can type {1,2,3,4} in the box before clicking Ok.

Here are the answers you should expect from each of the methods by using the indicated arrays:

sum() (using {1,2,3,4}): 10.0
sumOfSquares() (using {1,2,3,4}): 30.0
sumOfProducts() (using {1,2,3,4} and {5,6,7,8}): 70.0

2 Show us your code and output for the 3 summing methods.

Time to Calculate!

It's now time to put these pieces together and do the full calculations. Looking at the original formulas, you can see that both the slope and intercept formulas need to use the average of the x values and the average of the y values. That tells us that calculating both averages inside the main() method would save some processing time.

Uncomment the lines that declare xAverage and yAverage, and finish their calculations by calling one of the methods you wrote for the last checkpoint. Is there more than one choice for a denominator?

Earlier in the class, look for a method named findM(). This method performs the calculation needed for the slope. Examine its parameters. Uncomment the line inside that can now run, since you've written your methods. Now, add a call to this method from your main() method and save the value it returns in your own variable that holds the slope.

Using the findM() method as a model, write 2 more methods:
public static double findB(double avgX, double avgY, double slope) and
public static double findRSquared(double [] xValues, double [] yValues) Each of these uses the corresponding formula in the prelab to calculate and return values for the intercept and correlation squared, respectively.
When you are ready to try them out, add calls in your main() method and save the values in the variables you declared in checkpoint 1.

After making these method calls, print the values that you have calculated.
When you run main(), you should see these values:

m = 0.09496242563609694
b = -0.3514937389023274
r² = 0.9107185718154513

3 Show us your finished calculation methods and run main() so we can see your results.

Plotting the Curve

Back to reviewing the original equation:

y = mx + b

Since you have calculated values for m and b, we can use those values to plot the full set of points on a graph and draw the associated best-fit line through them.

For this checkpoint, you will finish the plot() method. When working with graphics, we draw images based on pixel coordinates. These have the origin (0,0) in the upper left and each pixel to the right or down increments x or y by 1. This doesn't work out very well for our graphing. We want the origin in the lower left and we want to stretch or shrink our x and y values to fit into the window. To do this we include methods to translate our x's and y's to pixel values.

public static int toPixelX(double x)
takes a double representing the equation's x-value as a parameter and returns the equivalent pixel position. The pixel position was calculated using this formula: pixelX = 10 + x/4.
public static int toPixelY(double y)
takes a double representing the equation's y-value as a parameter and returns the equivalent pixel position. The pixel position was calculated using this formula: pixelY = 450 - 2y.
toPixelX() and toPixelY() really depend on the data set. If the values in the arrays were different, these two methods would have to change. See Extras for Experts below, for example.
public void plot()
should draw the x- and y- axes, plot all points, and draw the best-fit line. This method is already stubbed in. Our class extends ACM library class named GraphicsProgram. While inheritance is an advanced concept, what it means is that the graphics "stuff", e.g. methods and colors, is already written. We just have to use them.
Our method begins by calling this.start();. This opens the graphics window on the screen. It also contains commented code for drawing the axes. Uncommenting those lines will draw the axes. Uncomment the line in the main() method that calls plot. Run the main() method to make certain the axes appear. You might have to expand the lower part of your window to see the x-axis at the bottom.
1. Plotting the points:
  The code in plot() contains special lines which draw GLines (lines) and GOvals (circles, in this case) in the plotting window. There is a partially complete for loop in the code which contains 3 lines in its body: the first 2 are currently commented out and incomplete. Those lines calculate the positions for the circles that will be plotted. The last line adds the circle to the plotting window.
  Complete the first 2 lines in the loop body that call toPixelX() and toPixelY() to determine where to place the circle. Also, make your loop go through the entire length of the arrays by completing the missing parts of the for control so that it does so.
2. Draw the best-fit line:
  The last line in plot() draws a line in the plotting window that best fits the points. You will need to calculate the values for the variables which are currently commented out:
  You want to draw a line between x = 0 and x = 2000. Calculate each corresponding y value from the formula y = mx + b. Translate each x and y to its equivalent pixel position, and the last line of code will draw the line.

4 Show us the code for the methods, and run the program so we can see the graph. It should look like the figure given in your prelab.

Don't forget to exit Firefox before you log out.

5 Show us that you have logged out, cleaned up, turned off your monitor and pushed in your chairs for this last checkpoint.