Network Access with GAWK


Network Access - How to get things running

The awk scripting language was originally developed as a pattern-matching language for writing short programs to perform data manipulation tasks. It was never meant to be used for networking purposes. awkīs strength is the manipulation of textual data that is stored in files. If we want to exploit its features in a networking context, we have to use an access mode for network connections that resembles the access of files as closely as possible. Therefore we have the following three special files (listed in descending order of importance). These files let us use the protocols of the same name for establishing connections.

  1. `/inet/tcp' is used for reliable connections. When accessing this special file, we initiate a connection that uses the TCP protocol; the protocol that the WWW is based on. Using TCP involves some overhead but novices as well as experts are well advised to use TCP because it takes away most of the burdens of network programming.
  2. `/inet/udp' is used for unreliable connections. When accessing this special file, we initiate a connection that uses the UDP protocol; the protocol that the traditional NFS file system is based on. UDP causes few overhead but it cannot guarantee reliability; therefore the application level protocol (for example NFS) has to take care of reliability.
  3. `/inet/raw' is used for low level data transmission. When accessing this special file, we are responsible for each and every bit that controls IP network traffic. We are usually not interested in doing so, but it is available for the sake of completeness. It can be interesting for those who do experiments with protocols on top of IP and have root privileges to do so.

In this chapter, we will only demonstrate how to use the TCP protocol. The other protocols are much less important for most users (UDP) or even untractable (RAW).

awk is also meant to be a prototyping language. It is used to demonstrate feasibility and to play with features and user interfaces. This can be done with the above mentioned file-like handling of network connections. For the convenience of a simple handling, we trade the lack of many of the advanced features of the TCP/IP family of protocols. Such features are available when programming in C or Perl. In fact, what we do in this chapter is very similar to what is described in books like Internet Programming with Python and Advanced Perl Programming or Web Client Programming with Perl. But we do it without first learning object oriented ideology, underlying languages like Tcl/Tk, Perl, Python and all the libraries necessary to extend these languages before they are ready for the Internet.

Establishing a TCP connection

Let us observe a network connection at work. Type in the following program and watch the output. Within a second it connects via TCP (`/inet/tcp') to the machine it is running on (`localhost') and asks the service `daytime' on the machine what time it is.

BEGIN {
  "/inet/tcp/0/localhost/daytime" |& getline
  print $0
}

Even experienced awk users will find the second line strange in two respects:

  1. A special file is used as a shell command which pipes its output into getline. One would rather expect to see the special file being read like any other file (getline < "/inet/tcp/0/localhost/daytime").
  2. The operator |& has not been part of any awk implementation up to now. It is actually the only extension of the awk language needed (apart from the special files) to introduce network access.

Arnold Robbins decided to introduce the |& operator in order to overcome the crucial restriction that access to files and pipes in awk is always uni-directional. It was formerly impossible to use both access modes on the same file or pipe. Instead of changing the whole concept of file access, he decided to introduce the |& operator which behaves exactly like the usual pipe operator except for two additions:

  1. Usual shell commands connected to their GAWK program with a |& pipe can be accessed bi-directionally. The |& turned out to be a quite general, useful and natural extension of awk.
  2. Pipes that consist of a special file name for network connections are not executed as shell commands. Instead, they can be read and written to just like a full duplex network connection.

What happens in this program ? The operator |& tells getline to read a line from the special file `/inet/tcp/0/localhost/daytime'. We could also have printed a line into the special file. But instead we just read a line with the time, printed it and closed the connection by finishing the program.

It may well be that for some reason the above program does not run on your machine. When looking at possible reasons for this you will learn much about typical problems that arise in network programming. First of all your implementation of GAWK may not support network access because it is a pre 3.10 version or you do not have a network interface in your machine. Perhaps your machine uses some other protocol like DECnet or Novellīs IPX. For the rest of this chapter we will assume you work on a Unix machine that supports TCP/IP. If the above program does not run on such a machine, it could help to replace the name `localhost' with the name of your machine or its IP address. If it does, you could replace `localhost' with the name of another machine in your vicinity. This way, the program connects to another machine. Now you should see the date and time being printed by the program. Otherwise your machine may not support the `daytime' service. Try changing the service to `chargen' or `ftp'. This way, the program connects to other services that should give you some response. If you are curious, you should have a look at your file `/etc/services'. It could look like this:

Service nameecho echo discard discard daytime daytime chargen chargen ftp telnet smtp finger www pop2 pop2 pop3 pop3 nntp irc irc
Service number
7/tcp echo sends back each line it receivces
7/udp echo is good for testing purposes
9/tcp discard behaves like `/dev/null'
9/udp discard just throws away each line
13/tcp daytime sends date & time once per connection
13/udp
19/tcp chargen infinitely produces character sets
19/udp chargen is good for testing purposes
21/tcp ftp is the usual file transfer protocol
23/tcp telnet is the usual login facility
25/tcp smtp is the Simple Mail Transfer Protocol
79/tcp finger tells you who is logged in
80/tcp www is the HyperText Transfer Protocol
109/tcp pop2 is an older version of pop3
109/udp
110/tcp pop3 is the Post Office Protocol
110/udp pop3 is used for receiving email
119/tcp nntp is the USENET News Transfer Protocol
194/tcp irc is the Internet Relay Chat
194/udp
Here, you find a list of services that traditional Unix machines usually support. If your GNU/Linux machine does not do so, it may be that these services are switched off in some startup script. Systems running some flavour of Microsoft Windows usually do not support such services. Nevertheless, it is possible to do networking with GAWK on Microsoft Windows. The first column of the file holds the name of the service, the second a unique number and the protocol that one can use to connect to this service. You see that some services (`echo') support TCP as well as UDP.

Interacting with a network service

The next program makes use of the possiblity to really interact with a network service by printing something into the special file. It asks the so called finger service if a user of the machine is logged in. When testing this program, you should also try to change `localhost' to some other machine name in your local network.

BEGIN {
  NetService = "/inet/tcp/0/localhost/finger"
  print "name" |& NetService
  while ((NetService |& getline) > 0)
    print $0
  close(NetService)
}

After telling the service on the machine which user it is looking for, the program repeatedly reads lines that come as a reply. When no more lines are coming (because the service has closed the connection), the program also closes by finishing. Try replacing name by your login name or the name of someone else logged in. If you want a list of all users currently logged in, replace name by an empty string `'. You could safely delete the final close() command from the above script because the operating system closes any open connection by default when a script reaches the end of execution. In order to avoid portability problems, we always close connections explicitly. With the Linux kernel, for example, proper closing results in flushing of buffers, while letting the close() happen by default will result in discarding buffers.

In the early days of the Internet (until around 1992), you could use such a program to check if some user in another country was logged in on a specific machine. RFC 1288 will give you the exact definition of the finger protocol. Every contemporary Unix system also has a command named finger, which functions as a client for the protocol of the same name. Still today, some people maintain simple information systems with this ancient protocol. For example, by typing

finger quake@seismo.unr.edu

[..]

DATE-(UTC)-TIME   LAT    LON      DEP   MAG    COMMENTS
yy/mm/dd hh:mm:ss   deg.   deg.    km

98/12/14 21:09:22  37.47N 116.30W   0.0 2.3Md  76.4 km   S of WARM SPRINGS, NEVA
98/12/14 22:05:09  39.69N 120.41W  11.9 2.1Md  53.8 km WNW of RENO, NEVADA      
98/12/15 14:14:19  38.04N 118.60W   2.0 2.3Md  51.0 km   S of HAWTHORNE, NEVADA 
98/12/17 01:49:02  36.06N 117.58W  13.9 3.0Md  74.9 km  SE of LONE PINE, CALIFOR
98/12/17 05:39:26  39.95N 120.87W   6.2 2.6Md 101.6 km WNW of RENO, NEVADA      
98/12/22 06:07:42  38.68N 119.82W   5.2 2.3Md  50.7 km   S of CARSON CITY, NEVAD

you get the latest Earthquake Bulletin for the state of Nevada. It contains time, location, depth, magnitude and a short comment about the earth quakes registered in that region during the last 10 days. In many places today the use of such services is restricted because most networks have firewalls and proxy servers between themselves and the Internet. Most firewalls are programmed not to let such a finger request go beyond the local network.

Another (ab)use of the finger protocol are several Coke machines that are connected to the Internet. There is a short list of such Coke machines. You can access them either from the command line or with a simple GAWK script. They will usually tell you about the different flavours of Coke and Beer available there. If you have an account there, you can even order some drink this way.

When looking at `/etc/services' you may have noticed that the `daytime' service is also available with `udp'. In the example above change `tcp' to `udp' and `finger' to `daytime'. After starting the example, you will see the expected day and time message and then the program hangs because it waits for more lines coming from the service, but they never come. This behaviour is a consequence of the differences between TCP and UDP. When using UDP, neither party is automatically being informed about the other closing the connection. When going on in experimenting this way you will experience many other subtle differences between TCP and UDP. To avoid such trouble one should always remember the advice Comer & Stevens give in volume III of their series Internetworking With TCP (page 14):

When designing client-server applications, beginners are strongly advised to use TCP because it provides reliable, connection-oriented communication. Programs only use UDP if the application protocol handles reliability, the application requires harware broadcast or multicast, or the application cannot tolerate virtual circuit overhead.

Setting up a service

The preceding programs behaved as clients which connected to a server somewhere on the net and requested a certain service. Now we will set up such a service ourselves which mimics the behaviour of the daytime service. Such a server does not know in advance, who is going to connect to it over the network. Therefore we cannot insert a name for the host to connect to in our special file name. Start the following program in one window. Notice that our service does not have the name daytime but the number 8888. From looking at `/etc/services' you know that names like daytime are just mnemonics for some 16 bit integers. Only root could enter our new service into the `/etc/services' with an appropriate name. Also notice that the service name has to be entered into a different field of the special file name because here we set up a server - not a client.

BEGIN {
  print strftime() |& "/inet/tcp/8888/0/0"
}

Now open another window (on the same machine) and start the client program that was given in the first example of this chapter. But before starting it, be sure to also change the name daytime to 8888. After starting the changed client program, you get a reply like this

Sat Sep 27 19:08:16 CEST 1997

and both programs have closed the connections now by terminating themselves.

Now we will intentionally make a mistake to see what happens when the name 8888 (the so called port) is already used by another service. Start the server program in both windows. The first one will do, but the second one will complain that it could not open the connection. Each port on a certain machine can only be used by one server program at a time. Now terminate the server program and change the name 8888 to echo. After restarting it, the server program will not run any more and you know why: There already is an echo service on your machine running. But even if there was no echo service already running, on a Unix machine you would not get your own echo server running, because the ports with numbers smaller than 1024 (echo is at port 7) are reserved for root. On machines running some flavour of Microsoft Windows, there is no restriction that reserves ports 1 to 1024 for a privileged user and hence you can start one echo server there.

Turning this short server program into something really useful is simple. Imagine a server that first reads a file name from the client through the network connection, then does something with the file and finally sends the result back to the client. The server side processing could be

BEGIN {
  NetService = "/inet/tcp/8888/0/0"
  NetService |& getline
  CatPipe    = "cat " $1
  while ((CatPipe | getline) > 0)
    print $0 |& NetService
}

and we would have a remote copying facility. Such a server reads the name of a file from any client that connects to it and transmits the contents of the named file across the net. The server side processing could also be the execution of a command that was transmitted across the network. By this example you see how simple it is to open up a security hole on your machine. If you really allowed clients to connect to your machine and execute arbitrary commands, everyone would be free to do rm -rf *.

Reading email

Have you ever wondered what your Netscape or Pine email client does when it retrieves your email from the email server ? In this section we will see. The distribution of email is usually done by dedicated email servers that communicate to your machine with special protocols. To receive email, we will use the Post Office Protocol (POP) which is defined in RFC1939. Sending can be done with the much older Simple Mail Transfer Protocol (SMTP) which is defined in RFC821, see RFCs in HTML.

When you type in the following program, replace the emailhost by the name of your local email server. Ask your administrator if the server has a POP service and then replace its name or number in the program below. Now the program is ready to connect to your email server but it will not succeed in retrieving your mail because it does not know yet your login name and your password. Replace them in the program and the program will show you the first email the server has in store.

BEGIN {
  POPService  = "/inet/tcp/0/emailhost/pop3"
  RS = ORS = "\r\n"
  print "user name"            |& POPService
  POPService                    |& getline
  print "passpassword"           |& POPService
  POPService                    |& getline
  print "retr 1"                |& POPService
  POPService                    |& getline
  if ($1 != "+OK") exit
  print "quit"                  |& POPService
  RS = "\r\n.\r\n";
  POPService |& getline;
  print $0
}

The record separators RS and ORS are redefined because the protocol (POP) requires it to separate lines this way. After identifying yourself to the email service, the command retr 1 instructs the service to send the first of all your emails in line. If the service replies with something other than +OK, the program exits; maybe there is no email. Otherwise the program first announces that it intends to finish reading email and finally redefines RS in order to read the entire email as multiline input in one record. From RFC1939 we know that the body of the email always ends with a single line containing a single dot. You can invoke this program as often as you like; it will not delete the message it read, but leave it on the server.

Reading a web page

Could it be that retrieving a web page from a web server is as simple as retrieving an email from an email server ? Yes, it is. We only have to use a different but quite similar protocol and a different port. The name of the protocol is HyperText Transfer protocol (HTTP) and the port number usually is 80. As in the preceding section, ask your administrator about the name of your local web server or proxy web server and its port number for HTTP requests. More detailed information about HTTP can be found at the home of the web protocols including the specification of HTTP in RFC2068. The only book about this protocol that I know can be found at http://www.browsebooks.com/Hethmon/?882.

The following program employs a rather crude approach toward retrieving a web page because it uses the prehistoric syntax of HTTP 0.9 which almost all web servers still support. The most noticeable thing about it is that the program directs the request to the local proxy server whose name you insert in the special file name (which in turn calls www.yahoo.com).

BEGIN {
  RS = ORS = "\r\n"
  HttpService = "/inet/tcp/0/proxy/80"
  print "GET http://www.yahoo.com"     |& HttpService
  while ((HttpService |& getline) > 0) print $0
}

Here, again, lines are separated by a redefined RS and ORS. The GET request that we send to the server is the only kind of HTTP request that existed when the web was created in the early 90s. The HTTP names this GET request a "method" that will tell the service to transmit a web page (here the home page of the Yahoo search engine). Version 1.0 (RFC1945) added the request methods HEAD and POST. The current version of HTTP is 1.1 and knows the additional request methods OPTIONS, PUT, DELETE, and TRACE. You can fill in any valid web address and the program will print the HTML code of that page onto your screen. Notice the similarity between the response of the POP service and the HTTP service. First you get a header that is terminated by an empty line and then you get the body of the page in HTML. The lines of the headers also have the same form as in POP. First there is the name of a parameter, then a colon and finally the value of that parameter.

You can also retrieve GIF files this way, but you will get binary data then, that should be redirected into a file. Another application is calling a CGI script on some server. CGI scripts are used when the contents of a web page is not constant but generated instantly at the moment you send a request for it. For example, to get a detailed report about the current quotes of the Motorola stock shares, call a CGI script at Yahoo with

print "GET http://quote.yahoo.com/q?s=MOT&d=t" |& HttpService

You could also call for weather reports this way. A good book to go on with is the HTML Source Book. There are also some books on CGI programming like the one by Thomas Boutell and this one. Another good source is The CGI Resource Index.

A primitive web service

Now we know enough about HTTP to set up a primitive web service that just says "Hello world" when someone connects to it with a Netscape browser. Compared to the situation in the preceding section, our program changes the role. It tries to behave just like the server we have observed. Since we are setting up a server here, we have to insert the port number in the localport field of the special file name. The other two fields (hostname and remoteport) have to contain a 0 because we do not know in advance which host will connect to our service.

In the early 90s all a server had to do was to send an HTML document and close the connection. Here we will adhere to the modern syntax of HTTP and first send a status line telling the Netscape browser that everything is OK. Then we send a line to tell the browser how many bytes follow in the body of the message. This was not necessary in the early 90s because both parties knew that the document ended when the connection closed. Nowadays it is possible to stay connected after the transmission of one web page. This is to avoid the network traffic necessary for repeatedly establishing TCP connections for requesting several images. Therefore the need to tell the receiving party how many bytes will be sent. The header is terminated as usual with an empty line. Finally we send the "Hello world" body in HTML. The useless while loop swallows the request of the browser. We could actually omit the loop and on most machines the program would still work. To check this one out, first start the following program.

BEGIN {
  RS = ORS = "\r\n"
  HttpService = "/inet/tcp/8080/0/0"
  Hello = "<HTML><H1>Hello world</H1></HTML>"
  print "HTTP/1.0 200 OK"                                  |& HttpService
  print "Content-Length: " length(Hello) + length(ORS) ORS |& HttpService
  print Hello                                              |& HttpService
  while ((HttpService |& getline) > 0) ;
}

Now (on the same machine) start your favourite browser and let it point to http://localhost:8080. You see, the browser needs to know which port our server is listening at for requests. If this does not work, the browser probably tries to connect to a proxy server which does not know your machine. Then, change the browser's configuration so that the browser will not try to use a proxy to connect to your machine.

A web service with interaction

Setting up a web service that allows user interaction is more difficult and leads us to the limits of network access in GAWK. In this section, we will develop a main program (a BEGIN pattern and its action) that will become the core of event-driven execution controlled by a GUI. Each HTTP event that the user triggers by some action within the browser will be received in this central procedure. Parameters and menue choices are extracted from this request and an appropriate measure taken according to the user's choice.

BEGIN {
  if (MyHost == "") { "uname -n"  |  getline MyHost; close("uname -n") }
  if (MyPort ==  0) MyPort = 8080
  HttpService = "/inet/tcp/" MyPort "/0/0"
  MyPrefix    = "http://" MyHost ":" MyPort
  SetUpServer()
  while ("GAWK" < "Perl") {
    RS = ORS    = "\r\n"          # header lines are terminated this way
    Status      = 200             # this means OK
    Reason      = "OK"
    Header      = TopHeader
    Document    = TopDoc
    Footer      = TopFooter
    if        (GETARG["Method"] == "GET")     { HandleGET()
    } else if (GETARG["Method"] == "HEAD")    { # not yet implemented
    } else if (GETARG["Method"]!=""){print "bad method",GETARG["Method"]
    }
    Prompt = Header Document Footer
    print "HTTP/1.0", Status, Reason                      |& HttpService
    print "Content-length:", length(Prompt) + length(ORS) |& HttpService
    print ORS Prompt                                      |& HttpService
    while ((HttpService |& getline) > 0) ; # ignore all the header lines
    close(HttpService)                     # stop talking to this client
    HttpService |& getline                 # wait for new client request
    print systime(), strftime(), $0        # do some logging
    delete GETARG;         delete MENUE;        delete PARAM
    GETARG["Method"]=$1;   GETARG["URI"]=$2;    GETARG["Version"]=$3
    for (i=length($2); (substr($2,i,1)!="?") && (i>0); i--) ;
    if (i > 0) {             # is there a "?" indicating a CGI request ?
      split(substr($2, 1, i-1), MENUE, "[/:]")
      split(substr($2, i+1), PARAM, "&")
      for (i in PARAM) {
        j = index(PARAM[i], "=")
        GETARG[substr(PARAM[i], 1, j-1)] = substr(PARAM[i], j+1)
      }
    } else {             # there is no "?", no need for splitting PARAMs
      split($2, MENUE, "[/:]")
    }
  }
}

This web server presents menue choices in the form of HTML links. Therefore, it has to tell the browser the name of the host it is residing on. When starting the server, the user may supply the name of the host from the command line with gawk -vMyHost="Rumpelstilzchen". If he does not do this, the server looks up the name of the host it is running on for later use as a web address in HTML documents. The same applies to the port number. These values will later be inserted into the HTML content of the web pages to refer to the home system.

Each server that we will build around this core has to initialize some application dependent variables (like the default home page) in a procedure SetUpServer(), which will be called immediately before entering the infinite loop of the server. For now, we will write an instance that initiates a trivial interaction. With this home page, the client user can click on two possible choices and will get the current date either in human readable format or in seconds since 1970.

function SetUpServer() {
  TopHeader = "<HTML><title>My name is GAWK, GNU AWK</title></HEAD>"
  TopDoc    = "<h2>\
    Do you prefer your date human or\
    POSIXed ?</h2>" ORS ORS
  TopFooter = "</BODY></HTML>"
}

On the first run through the main loop, the default line terminators are set and the default home page is copied to the actual home page. Since this is the first run, GETARG["Method"] is not initialized yet, hence the case selection over the method does nothing. Now that the home page is initialized, the server can start communicating to a client browser.

Having supplied the initial home page to the browser with a valid document stored in the parameter Prompt, it closes the connection and waits for the next request. When the request comes, a log line is printed that allows us to see which request the server receives. Then, all variables for global storage of request parameters are first cleared and filled with the extracted new values. Next, the name of the requested resource is split into parts and stored for later evaluation. If the request contains a ?, then the request has CGI variables seamlessly appended to the web address. Everything in front of the ? is split up into menue items and everything behind the ? is a list of variable=value pairs (separated by &) that also need splitting. This way, CGI variables are isolated and stored. This procedure lacks recognition of special characters that are transmitted in coded form (as defined in RFC2068). Here, any optional request header and body parts is ignored. We do not need header parameters and the request body, but when refining our approach or working with the POST and PUT methods, also reading the header and body will become inevitable. Header parameters should then be stored in a global variable as well as the body.

On each subsequent run through the main loop, one request from a browser is received, evaluated and answered according to the user's choice. This can be done by letting the value of the HTTP method guide the main loop into execution of the procedure HandleGET() which evaluates the user's choice. In this case, we have only one hierarchical level of menues, but menues are nested in the general case. The menue choices at each level are separated by / just like in file names. Notice how simple it is to construct menues of arbitrary depth.

function HandleGET() {
  if(        MENUE[2] == "human") {
    Footer  =  strftime() TopFooter
  } else if (MENUE[2] == "POSIX") {
    Footer  =  systime()  TopFooter
  }
}

The main disadvantage of this approach is that our server is slow and can handle only one request at a time. Its main advantage is that the server consists of just one GAWK program. No need for installing an `httpd', no need for static separate HTML files, CGI scripts or root privileges. This is rapid prototyping.

Start this program on the same host that runs your browser. Then let your browser point at http://localhost:8080.

It is also possible to include images into the HTML pages. Most browsers support the not very well known `.xbm' format, which may contain only monochrome pictures but is an ASCII format. Binary images are possible but not so easy to handle. Another way of including images is to generate them with a tool like GNUPLOT by calling the tool with the system() call or through a pipe.

A simple web server

In the preceding section, we built the core logic for event driven GUIs. In this section, we will finally extend the core to a real application. No one would actually write a commercial web server in GAWK but it is instructive to see that it is feasible in principle.

The application is ELIZA, the famous program by Joseph Weizenbaum that mimics the behaviour of a professional psychotherapist when talking to you. Weizenbaum would certainly object to this description, but this is part of the legend around ELIZA. Take the site independent core logic and append the following code. You will recognize that SetUpServer() is similar to the example above, except for another function SetUpEliza() being called. You can use this approach to implement other kinds of servers. The only changes needed to do so are hidden in the functions SetUpServer() and HandleGET(). Perhaps you also want to implement other HTTP-methods.

When extending this example to a complete application, the first thing to do is to implement the function SetUpServer() that initializes the HTML pages and some variables. These initializations determine the way your HTML pages will look (colours, titles, menue items etc.).

function SetUpServer() {
  SetUpEliza()
  TopHeader = "<HTML><title>An HTTP-based System with GAWK</title>\
               <HEAD><META HTTP-EQUIV=\"Content-Type\"\
               CONTENT=\"text/html; charset=iso-8859-1\"></HEAD>\
               <BODY BGCOLOR=\"#ffffff\" TEXT=\"#000000\" LINK=\"#0000ff\"\
                     VLINK=\"#0000ff\" ALINK=\"#0000ff\"> "
  TopDoc    = "\
    <h2>Please choose one of the following actions:</h2>\
    <UL>\
      <LI>About this server\
      <LI>About Eliza\
      <LI>Start talking to  Eliza\
    </UL>"
  TopFooter = "</BODY></HTML>"
}

The function HandleGET() decides which page the user wants to see next. It is a nested case selection. Each nesting level refers to a menue level of the GUI. Each case implements a certain action of the menue. On the deepest level of case selection the handler essentially knows what the user wants and stores the answer into a variable that holds the HTML page contents.

function HandleGET() {
  # A real HTTP-server would treat some parts of the URI as a file name.
  # We take parts of the URI as menu choices and go on accoringly.
  if(MENUE[2]   =="AboutServer") {
    Document    = "This is not a CGI script.\
      This is an httpd, an HTML-file and a CGI script all in one GAWK script.\
      It needs no separate www-server, no installation and no root privileges.\
      <br><br>To run it, do this:<br><ul>\
      <li> start this script with \"gawk -f httpserver.awk\",<br>\
      <li> and on the same host let your www browser open location\
           \"http://localhost:8080\"\
      </ul>\<br>\ Details of HTTP come from:<br><ul>\
            <li>Hethmon:  Illustrated Guide to HTTP<br>\
            <li>RFC 2068<br></ul><br>JK 14.9.97<br>"
  } else if (MENUE[2]=="AboutELIZA") {
    Document    = "This is an implementation of the famous ELIZA program\
                  by Joseph Weizenbaum. It is written in GAWK and uses\
                  an HTML GUI."
  } else if (MENUE[2]=="StartELIZA") {
    gsub(/\+/, " ", GETARG["YouSay"])
    # Here we also have to substitute coded special characters
    Document    = "<form method=GET><h3>" ElizaSays(GETARG["YouSay"]) "</h3>\
      <br><input type=text name=YouSay value=\"\" size=60>\
      <br><input type=submit value=\"Tell her about it\"> </form>"
  }
}

Now we are down at the heart of ELIZA. Here you can see how it works. Initially the user does not say anything; then ELIZA resets its money counter and just asks the user to tell open heartedly what comes to mind. The subsequent answers are first converted to upper case and stored for later comparison. ELIZA will present the bill when being confronted with a sentence that contains the phrase "shut up". Otherwise it looks for keywords in the sentence, conjugates the rest of the sentence, remembers the keyword for later use and finally selects an answer from the set of possible answers.

function ElizaSays(YouSay) {
  if (YouSay == "") {
    cost=0
    answer = "HI, IM ELIZA, TELL ME YOUR PROBLEM"
  } else {
    q = toupper(YouSay)
    gsub("'", "", q)
    if(q == qold) {
      answer = "PLEASE DONT REPEAT YOURSELF !"
    } else {
      if(index(q, "SHUT UP")>0) {
        answer = "WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $"\
                 int(100*rand()+30+cost/100)
      } else {
        qold = q
        w="-"                   # no keyword recognized yet
        for (i in k) {          # search for keywords
          if(index(q, i) > 0) {
            w=i;
            break
          }
        }
        if (w == "-") {         # no keyword, take old subject
          w    = wold;
          subj = subjold
        } else {                # find subject 
          subj = substr(q, index(q, w) + length(w)+1)
          wold = w;
          subjold = subj        #  remember keyword and subject
        }
        for (i in conj) gsub(i, conj[i], q)   # conjugation
        # from all answers to this keyword, select one randomly
        answer = r[indices[int(split(k[w], indices) * rand()) + 1]]
        # insert subject into answer
        gsub("_", subj, answer)
      }
    }
  }
  cost += length(answer) # for later payment : 1 cent per character
  return answer
}

In the long but simple function SetUpEliza() you can see tables for conjugation, keywords and answers. The associative array k[] contains indizes into the array of answers r[]. To choose an answer, ELIZA just picks an index randomly.

function SetUpEliza() {
  srand()
  wold = "-"
  subjold = " "

  # table for conjugation
  conj[" ARE "     ] = " AM "
  conj["WERE "     ] = "WAS "
  conj[" YOU "     ] = " I "
  conj["YOUR "     ] = "MY "
  conj[" IVE "     ] =\
  conj[" I HAVE "  ] = " YOU HAVE "
  conj[" YOUVE "   ] =\
  conj[" YOU HAVE "] = " I HAVE "
  conj[" IM "      ] =\
  conj[" I AM "    ] = " YOU ARE "
  conj[" YOURE "   ] =\
  conj[" YOU ARE " ] = " I AM "

  # table of all answers
  r[1]   = "DONT YOU BELIEVE THAT I CAN  _"
  r[2]   = "PERHAPS YOU WOULD LIKE TO BE ABLE TO _ ?"
  r[3]   = "YOU WANT ME TO BE ABLE TO _ ?"
  r[4]   = "PERHAPS YOU DONT WANT TO _ "
  r[5]   = "DO YOU WANT TO BE ABLE TO _ ?"
  r[6]   = "WHAT MAKES YOU THINK I AM _ ?"
  r[7]   = "DOES IT PLEASE YOU TO BELIEVE I AM _ ?"
  r[8]   = "PERHAPS YOU WOULD LIKE TO BE _ ?"
  r[9]   = "DO YOU SOMETIMES WISH YOU WERE _ ?"
  r[10]  = "DONT YOU REALLY _ ?"
  r[11]  = "WHY DONT YOU _ ?"
  r[12]  = "DO YOU WISH TO BE ABLE TO _ ?"
  r[13]  = "DOES THAT TROUBLE YOU ?"
  r[14]  = "TELL ME MORE ABOUT SUCH FEELINGS"
  r[15]  = "DO YOU OFTEN FEEL _ ?"
  r[16]  = "DO YOU ENJOY FEELING _ ?"
  r[17]  = "DO YOU REALLY BELIEVE I DONT _ ?"
  r[18]  = "PERHAPS IN GOOD TIME I WILL _ "
  r[19]  = "DO YOU WANT ME TO _ ?"
  r[20]  = "DO YOU THINK YOU SHOULD BE ABLE TO _ ?"
  r[21]  = "WHY CANT YOU _ ?"
  r[22]  = "WHY ARE YOU INTERESTED IN WHETHER OR NOT I AM _ ?"
  r[23]  = "WOULD YOU PREFER IF I WERE NOT _ ?"
  r[24]  = "PERHAPS IN YOUR FANTASIES I AM _ "
  r[25]  = "HOW DO YOU KNOW YOU CANT _ ?"
  r[26]  = "HAVE YOU TRIED ?"
  r[27]  = "PERHAPS YOU CAN NOW _ "
  r[28]  = "DID YOU COME TO ME BECAUSE YOU ARE _ ?"
  r[29]  = "HOW LONG HAVE YOU BEEN _ ?"
  r[30]  = "DO YOU BELIEVE ITS NORMAL TO BE _ ?"
  r[31]  = "DO YOU ENJOY BEING _ ?"
  r[32]  = "WE WERE DISCUSSING YOU -- NOT ME"
  r[33]  = "Oh, I _"
  r[34]  = "YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?"
  r[35]  = "WHAT WOULD IT MEAN TO YOU, IF YOU GOT _ ?"
  r[36]  = "WHY DO YOU WANT _ ?"
  r[37]  = "SUPPOSE YOU SOON GOT _"
  r[38]  = "WHAT IF YOU NEVER GOT _ ?"
  r[39]  = "I SOMETIMES ALSO WANT _"
  r[40]  = "WHY DO YOU ASK ?"
  r[41]  = "DOES THAT QUESTION INTEREST YOU ?"
  r[42]  = "WHAT ANSWER WOULD PLEASE YOU THE MOST ?"
  r[43]  = "WHAT DO YOU THINK ?"
  r[44]  = "ARE SUCH QUESTIONS IN YOUR MIND OFTEN ?"
  r[45]  = "WHAT IS IT THAT YOU REALLY WANT TO KNOW ?"
  r[46]  = "HAVE YOU ASKED ANYONE ELSE ?"
  r[47]  = "HAVE YOU ASKED SUCH QUESTIONS BEFORE ?"
  r[48]  = "WHAT ELSE COMES TO MIND WHEN YOU ASK THAT ?"
  r[49]  = "NAMES DON'T INTEREST ME"
  r[50]  = "I DONT CARE ABOUT NAMES -- PLEASE GO ON"
  r[51]  = "IS THAT THE REAL REASON ?"
  r[52]  = "DONT ANY OTHER REASONS COME TO MIND ?"
  r[53]  = "DOES THAT REASON EXPLAIN ANYTHING ELSE ?"
  r[54]  = "WHAT OTHER REASONS MIGHT THERE BE ?"
  r[55]  = "PLEASE DON'T APOLOGIZE !"
  r[56]  = "APOLOGIES ARE NOT NECESSARY"
  r[57]  = "WHAT FEELINGS DO YOU HAVE WHEN YOU APOLOGIZE ?"
  r[58]  = "DON'T BE SO DEFENSIVE"
  r[59]  = "WHAT DOES THAT DREAM SUGGEST TO YOU ?"
  r[60]  = "DO YOU DREAM OFTEN ?"
  r[61]  = "WHAT PERSONS APPEAR IN YOUR DREAMS ?"
  r[62]  = "ARE YOU DISTURBED BY YOUR DREAMS ?"
  r[63]  = "HOW DO YOU DO ... PLEASE STATE YOUR PROBLEM"
  r[64]  = "YOU DON'T SEEM QUITE CERTAIN"
  r[65]  = "WHY THE UNCERTAIN TONE ?"
  r[66]  = "CAN'T YOU BE MORE POSITIVE ?"
  r[67]  = "YOU AREN'T SURE ?"
  r[68]  = "DON'T YOU KNOW ?"
  r[69]  = "WHY NO _ ?"
  r[70]  = "DON'T SAY NO, IT'S ALWAYS SO NEGATIVE"
  r[71]  = "WHY NOT ?"
  r[72]  = "ARE YOU SURE ?"
  r[73]  = "WHY NO ?"
  r[74]  = "WHY ARE YOU CONCERNED ABOUT MY _ ?"
  r[75]  = "WHAT ABOUT YOUR OWN _ ?"
  r[76]  = "CAN'T YOU THINK ABOUT A SPECIFIC EXAMPLE ?"
  r[77]  = "WHEN ?"
  r[78]  = "WHAT ARE YOU THINKING OF ?"
  r[79]  = "REALLY, ALWAYS ?"
  r[80]  = "DO YOU REALLY THINK SO ?"
  r[81]  = "BUT YOU ARE NOT SURE YOU _ "
  r[82]  = "DO YOU DOUBT YOU _ ?"
  r[83]  = "IN WHAT WAY ?"
  r[84]  = "WHAT RESEMBLANCE DO YOU SEE ?"
  r[85]  = "WHAT DOES THE SIMILARITY SUGGEST TO YOU ?"
  r[86]  = "WHAT OTHER CONNECTION DO YOU SEE ?"
  r[87]  = "COULD THERE REALLY BE SOME CONNECTIONS ?"
  r[88]  = "HOW ?"
  r[89]  = "YOU SEEM QUITE POSITIVE"
  r[90]  = "ARE YOU SURE ?"
  r[91]  = "I SEE"
  r[92]  = "I UNDERSTAND"
  r[93]  = "WHY DO YOU BRING UP THE TOPIC OF FRIENDS ?"
  r[94]  = "DO YOUR FRIENDS WORRY YOU ?"
  r[95]  = "DO YOUR FRIENDS PICK ON YOU ?"
  r[96]  = "ARE YOU SURE YOU HAVE ANY FRIENDS ?"
  r[97]  = "DO YOU IMPOSE ON YOUR FRIENDS ?"
  r[98]  = "PERHAPS YOUR LOVE FOR FRIENDS WORRIES YOU"
  r[99]  = "DO COMPUTERS WORRY YOU ?"
  r[100] = "ARE YOU TALKING ABOUT ME IN PARTICULAR ?"
  r[101] = "ARE YOU FRIGHTENED BY MACHINES ?"
  r[102] = "WHY DO YOU MENTION COMPUTERS ?"
  r[103] = "WHAT DO YOU THINK MACHINES HAVE TO DO WITH YOUR PROBLEMS ?"
  r[104] = "DON'T YOU THINK COMPUTERS CAN HELP PEOPLE ?"
  r[105] = "WHAT IS IT ABOUT MACHINES THAT WORRIES YOU ?"
  r[106] = "SAY, DO YOU HAVE ANY PSYCHOLOGICAL PROBLEMS ?"
  r[107] = "WHAT DOES THAT SUGGEST TO YOU ?"
  r[108] = "I SEE"
  r[109] = "IM NOT SURE I UNDERSTAND YOU FULLY"
  r[110] = "COME COME ELUCIDATE YOUR THOUGHTS"
  r[111] = "CAN YOU ELABORATE ON THAT ?"
  r[112] = "THAT IS QUITE INTERESTING"
  r[113] = "WHY DO YOU HAVE PROBLEMS WITH MONEY ?"
  r[114] = "DO YOU THINK MONEY IS EVERYTHING ?"
  r[115] = "ARE YOU SURE THAT MONEY IS THE PROBLEM ?"
  r[116] = "I THINK WE WANT TO TALK ABOUT YOU, NOT ABOUT ME"
  r[117] = "WHAT'S ABOUT ME ?"
  r[118] = "WHY DO YOU ALWAYS BRING UP MY NAME ?"

  # table for looking up answers that fit to a certain keyword 
  k["CAN YOU"]      = "1 2 3"
  k["CAN I"]        = "4 5"
  k["YOU ARE"]      =\
  k["YOURE"]        = "6 7 8 9"
  k["I DONT"]       = "10 11 12 13"
  k["I FEEL"]       = "14 15 16"
  k["WHY DONT YOU"] = "17 18 19"
  k["WHY CANT I"]   = "20 21"
  k["ARE YOU"]      = "22 23 24"
  k["I CANT"]       = "25 26 27"
  k["I AM"]         =\
  k["IM "]          = "28 29 30 31"
  k["YOU "]         = "32 33 34"
  k["I WANT"]       = "35 36 37 38 39"
  k["WHAT"]         =\
  k["HOW"]          =\
  k["WHO"]          =\
  k["WHERE"]        =\
  k["WHEN"]         =\
  k["WHY"]          = "40 41 42 43 44 45 46 47 48"
  k["NAME"]         = "49 50"
  k["CAUSE"]        = "51 52 53 54"
  k["SORRY"]        = "55 56 57 58"
  k["DREAM"]        = "59 60 61 62"
  k["HELLO"]        =\
  k["HI "]          = "63"
  k["MAYBE"]        = "64 65 66 67 68"
  k[" NO "]         = "69 70 71 72 73"
  k["YOUR"]         = "74 75"
  k["ALWAYS"]       = "76 77 78 79"
  k["THINK"]        = "80 81 82"
  k["LIKE"]         = "83 84 85 86 87 88 89"
  k["YES"]          = "90 91 92"
  k["FRIEND"]       = "93 94 95 96 97 98"
  k["COMPUTER"]     = "99 100 101 102 103 104 105"
  k["-"]            = "106 107 108 109 110 111 112"
  k["MONEY"]        = "113 114 115"
  k["ELIZA"]        = "116 117 118"
}

Some interesting remarks and details (including the original source code of ELIZA) can be found on Mark Humphrys' home page. Yahoo also has a page with a collection of ELIZA-like programs. Many of them written in Java and some even supply source code.

Where to go from here

Now, you have learned enough to build your own application. You could, for example, take part in the Loebner Contest to win $100.000. The Loebner Prize is the first formal instantiation of a Turing Test. Hugh Loebner agreed with The Cambridge Center for Behavioral Studies to underwrite a contest designed to implement the Turing Test. Dr. Loebner pledged a Grand Prize of $100,000 for the first computer whose responses were indistinguishable from a human's. Each year an annual prize of $2000 and a bronze medal is awarded to the most human computer. The winner of the annual contest is the best entry relative to other entries that year, irrespective of how good it is in an absolute sense. Here is an example of a conversation with the winning program of 1997:

(c)1997 Cambridge Center For Behavioral Studies all rights reserved [CONVERSE] 
[Intelligent Research Ltd.] Start at: [1997/04/29 08:59:40]
PROGRAM[10:39:20]Hi, my name is Catherine, what's yours?
JUDGE02[10:39:36]My name is Nathan.
PROGRAM[10:39:37]Well, it's nice to meet you, Nathan.
PROGRAM[10:39:45]Did you see that story on CNN last night about the lesbian
PROGRAM[10:40:03]couple who came out at a White House party on Sunday?
JUDGE02[10:40:45]No.  I'm sorry I missed that.  How did the Prez take it?
PROGRAM[10:40:47]Ellen Degeneres was one of them - she was kissing her lover
PROGRAM[10:41:08]in front of Bill Clinton. Do you think he was very
PROGRAM[10:41:21]embarrassed?
JUDGE02[10:41:37]You tell me.  I didn't see it.
PROGRAM[10:41:38]Are you still a Friend of Bill Clinton?
JUDGE02[10:42:22]Yes.  I would like to talk about something else.  Do you know anything about sculpture.
PROGRAM[10:42:24]Well, I suppose he's doing his best like most of us, but he
PROGRAM[10:42:35]does seem to attract an UNUSUAL amount of bad publicity,
PROGRAM[10:42:51]wouldnt you say--can it all be bad luck?
JUDGE02[10:43:20]I asked you if you knew anything about sculpture.
PROGRAM[10:43:21]OK, but what happens now--does it all drift on, with
PROGRAM[10:43:36]everyone waiting to see how much his friends have said about
PROGRAM[10:43:47]him under threat of lots and lots of jail-or will something
PROGRAM[10:44:01]dramatic actually happen?
JUDGE02[10:44:19]Do you know what sculpture is?
PROGRAM[10:44:20]Well, youre more a fan of his than I am I guess--though I
PROGRAM[10:44:34]cant SWEAR what Id have done in that locked hotel room, I
PROGRAM[10:44:51]must say.
JUDGE02[10:45:26]Have you ever been to the Salmagundi Club?
PROGRAM[10:45:29]Never. to go back to the original discussion Time to get
PROGRAM[10:45:43]off this one I think and onto something more down to earth!
JUDGE02[10:46:20]What is more down to earth than art?
PROGRAM[10:46:21]I don't know what -- to be honest. -
PROGRAM[10:46:30]I suppose it depends on where you're coming from, but as the
PROGRAM[10:46:46]song goes in My Fair Lady, "Why can't a woman be more like a
PROGRAM[10:47:03]man? "
JUDGE02[10:47:52]I liked that song but I prefer "Why can't children be more like we are"
JUDGE02[10:48:10]from another play the name of which I cannot remember.
PROGRAM[10:48:12]What, if anything, do you enjoy about being a man?
JUDGE02[10:49:01]I would enjoy all aspects of life no matter what my gender.
PROGRAM[10:49:02]Do you think of yourself as being attractive?

This program insists on always speaking about the same story around Bill Clinton. You see, even a program with a rather narrow mind can behave so much like a human being that it can win this prize. It is quite common to let these programs talk to each other via network connections. But during the competition itself, the program and its computer have to be present at the place the competition is held. We all would love to see a GAWK program win in such an event. Maybe it is up to you to accomplish this ?

Some other ideas for useful networked applications:

  1. Read the file `doc/awkforai.txt' in the GAWK distribution. It was written by Ronald P. Loui (loui@ai.wustl.edu, Associate Professor of Computer Science, at Washington University in St. Louis) and summarizes why he teaches GAWK to students of Artificial Intelligence. Here are some passages from the text:

    The GAWK manual can be consumed in a single lab session and the language can be mastered by the next morning by the average student. GAWK's automatic initialization, implicit coercion, I/O support and lack of pointers forgive many of the mistakes that young programmers are likely to make. Those who have seen C but not mastered it are happy to see that GAWK retains some of the same sensibilities while adding what must be regarded as spoonsful of syntactic sugar. [..] There are further simple answers. Probably the best is the fact that increasingly, undergraduate AI programming is involving the Web. Oren Etzioni (University of Washington, Seattle) has for a while been arguing that the "softbot" is replacing the mechanical engineers' robot as the most glamorous AI testbed. If the artifact whose behavior needs to be controlled in an intelligent way is the software agent, then a language that is well-suited to controlling the software environment is the appropriate language. That would imply a scripting language. If the robot is KAREL, then the right language is "turn left; turn right." If the robot is Netscape, then the right language is something that can generate "netscape -remote 'openURL(http://cs.wustl.edu/~loui)'" with elan. [..] AI programming requires high-level thinking. There have always been a few gifted programmers who can write high-level programs in assembly language. Most however need the ambient abstraction to have a higher floor. [..] Second, inference is merely the expansion of notation. No matter whether the logic that underlies an AI program is fuzzy, probabilistic, deontic, defeasible, or deductive, the logic merely defines how strings can be transformed into other strings. A language that provides the best support for string processing in the end provides the best support for logic, for the exploration of various logics, and for most forms of symbolic processing that AI might choose to call "reasoning" instead of "logic." The implication is that PROLOG, which saves the AI programmer from having to write a unifier, saves perhaps two dozen lines of GAWK code at the expense of strongly biasing the logic and representational expressiveness of any approach.

    Now that GAWK itself can connect to the Internet, it should be obvious that it is suitable to the problem of writing intelligent web agents.
  2. awk is strong at pattern recognition and string processing. So, it is well suited to the classic problem of language translation. A first try could be a program that knows the 100 most frequent english words and their counterparts in german or french. The service could be implemented by regularly reading email with the program above, replacing each word by its translation and sending the translation back via SMTP. Users would send an english email to their translation service and get back a translated email in return. As soon as this works, more effort can be spent on a real translation program.
  3. Another one of these dialogue oriented applications on the verge of ridicule is the email support service. Troubled customers write an email to an automatic GAWK service that reads the email. It looks for keywords in the mail and assembles a reply email accordingly. By carefully investigating the email header also and repeating these keywords through the reply email, it is rather simple to give the customer a feeling that someone cares. Ideally, such a service would search a database of previous cases for solutions. If none exists, the database could, for example, consist of all the newsgroups, mailing lists and FAQs on the Internet.

By now you also should have noticed that debugging such a networked application is more complicated than debugging a single-process/single-hosted application. The behaviour of a networked application sometimes looks non-causal because it is not reproducible in a strong sense. Wether your network application works or not sometimes depends on

  1. how crowded the underlying network is
  2. if the party at the other end is running or not
  3. the state of the party at the other end

The most difficult problems for a beginner arise from hidden states of the underlying network. After closing a TCP connection, you often have to wait a short while before reopening the connection. Even more difficult is the establishment of a connection that formerly ended with a "broken pipe". Those connections have to "time out" for a minute before you can reopen the connection. You can check this with the command netstat -a which gives you a list of still "active" connections.

Network Access - The details

Data transmission between two parties over the network occurs synchronously like in the rendez-vous concept. The party that comes first in a request for transmission (read or write) will wait for the other to become ready for transmission. But before going into the details, we must clarify the meaning of the terms client and server. The terminology of this chapter will look bizarre und incomprehensible to those who have never heard network jargon before. In order to make clear what happens some terminology is unavoidable. If TCP was absent from this description, things would be much simpler. The main reason for the subtle distinctions we have to make is that the system calls connect(), listen() and accept() (used with TCP) have to be hidden from the user. UDP and RAW are simpler during connection buildup but TCP is simpler afterwards.

After establishment of a connection, the application protocol (HTTP, POP3 or FTP) determines who is client, who is server and what each one is supposed to do. We will call this the application level client or server. Here, we do assume nothing about the intentions of programs accessing the special files. Furthermore, we assume that a connection has not yet been established. In the context of building up a connection, we use the term connection level client and server to denote the behaviour of a program during establishment of the connection. In this context we also have to distinguish between connection level client and server behaviour for each of the protocols TCP, UDP and RAW. In general, whoever comes first will be called the connection level server in this chapter. The other end will be called the connection level client. With TCP, the connection level server will block, no matter if reading or writing. UDP and RAW know only blocking reads. A connection level client will only block for reading with all protocols.

Establishing a connection will always put both ends into connected state in the sense of the socket API. This is true for TCP and UDP connections and implies that we always have point-to-point connections that enable exactly one party on one end to talk to exactly one party at the other end. Once being established, the connection between both will keep away other parties demanding connection. Only after closing a connection can a new one be built up. This is contrary to the usual behaviour of fully developed web servers which have to avoid situations in which they are not reachable. We have to pay this price in order to enjoy the benefits of a simple communication paradigm.

The special file name for network access is made up of several fields, all of them mandatory, none of them optional:

/inet/protocol/localport/hostname/remoteport

The inet field is of course constant when accessing the network. The localport and remoteport fields do not have a meaning when used with `/inet/raw' because "port" is a term only known to TCP and UDP. So, when using `/inet/raw' the port fields always have to be 0.

The fields of the special file name

Here, we will explain the meaning, the range of values and the defaults for all other fields.

  1. The mandatory protocol field determines which member of the TCP/IP family of protocols will be selected to transport the data across the network. There are three possible values (always written lowercase): tcp, udp and raw. The exact meaning of each is explained below. There is no default to this field.
  2. The mandatory field localport determines which port on the local machine will be used to communicate across the network. It has no meaning with RAW and must therefore be 0. Application level clients usually use 0 to indicate they do not bother which local port is used. Instead they specify a remote port to connect to. It is vital for application level servers to use a number different from 0 here because their service has to be available at a specific publicly known port number. It is possible to use a name from /etc/services here. No default value is assumed. The value 0 means "any".
  3. The mandatory field hostname determines which remote host has to be at the other end of the connection. Application level servers must fill this field with a 0 to indicate their being open for all other hosts to connect to and enforce connection level server behaviour this way. It is not possible for an application level server to restrict his availability to one remote host by entering a host name here. Application level clients must enter a name different from 0 here. The name can be either symbolic (jpl-devvax.jpl.nasa.gov) or numeric (128.149.1.143). No default value is assumed. The value 0 means "any" and enforces server behaviour.
  4. The optional field remoteport determines which port on the remote machine will be used to communicate across the network. It has no meaning with RAW and must therefore be 0. Application level clients must use a number different from 0 here to indicate which port on the remote machine they want to connect to. Application level servers must fill this field with a 0. Instead they specify a local port for clients to connect to. It is possible to use a name from /etc/services here. No default value is assumed. The value 0 means "any" and enforces server behaviour.

Experts in network programming will notice that the usual client-server-asymmetry found at the level of the socket API is not visible here. This is for the sake of simplicity of the high level concept. If you really miss this, use another language like C or Perl. For GAWK it is more important to enable users to write a client program with five lines of code. What happens, when first accessing a network connection can be seen in the following pseudo code:

if ((name of remote host given) && (other side accepts connection)) {
  rendez-vous successful; transmit with getline or print
} else {
  if ((other side did not accept) && (localport == 0))
    exit unsuccessful
  if (TCP) {
    set up a server accepting connections
    this means waiting for the client on the other side to connect
  } else {
    ready
  }
}

The exact behaviour of this algorithm depends on the values of the fields of the special file name. When in doubt, use the following table that gives you the combinations of values and their meaning. If you think this table is too complicated, restrict yourself to the three lines printed in bold letters. All examples of the preceding chapter used only the patterns printed in bold letters.

PROTOCOL tcpudp raw tcp, udptcp, udpraw tcp, udp, raw tcp, udp, raw tcp, udp, raw tcp, udp tcp, udp raw raw raw
LOCAL PORT HOST NAME REMOTE PORT RESULTING CONNECTION LEVEL BEHAVIOUR
0 x x dedicated client, fails if immediately connecting to a server on the other side fails
0 x x dedicated client
0 x 0 dedicated client, works only as root
x x x client, switches to dedicated server if necessary
x 0 0 dedicated server
0 0 0 dedicated server, works only as root
x x 0 invalid
0 0 x invalid
x 0 x invalid
0 0 0 invalid
0 x 0 invalid
x 0 0 invalid
0 x x invalid
x x x invalid
Now we will develop a pair of programs (sender and receiver) that do nothing but send a time stamp from one machine to another. We will implement the sender and the receiver with each of the three protocols available and discover the differences between them.

`/inet/tcp'

Once again, you should always use TCP. There are few circumstances that justify the use of UDP or RAW. We can take an earlier example as the sender program:

BEGIN {
  print strftime() |& "/inet/tcp/8888/0/0"
}

The receiver is almost identical to the first example of this chapter:

BEGIN {
  "/inet/tcp/0/localhost/8888" |& getline
  print $0
}

TCP can guarantee that the bytes at the receiving end come in in exactly the same order they were sent at the sending end. No byte will be lost (except for broken lines), no byte doubled, no byte out of order. Some overhead is necessary to accomplish this but this is the price we pay for a reliable service.

It does matter which side starts first. The sender/server has to be started first and will wait for the receiver to read a line.

`/inet/udp'

Both programs are almost identical to their TCP counterparts. Only the name of the `protocol' has changed. As before, it does matter which side starts first. The receiving side will block and wait for the sender. So, in this case, the receiver/client has to be started first.

BEGIN {
  print strftime() |& "/inet/udp/8888/0/0"
}

The receiver is almost identical to the first example of this chapter:

BEGIN {
  "/inet/udp/0/localhost/8888" |& getline
  print $0
}

UDP cannot guarantee that the datagrams at the receiving end come in in exactly the same order they were sent at the sending end. Some datagrams could be lost, some doubled, and some out of order. But no overhead is necessary to accomplish this. This unreliable behaviour is good enough for tasks like data acquisition, logging and even stateless services like NFS.

`/inet/raw'

This is an IP level protocol. Only root is allowed to access this special file. It is meant to be the basis for implementing and experimenting with transport level protocols. In the most general case, the sender has to supply the encapsulating header bytes in front of the packet and the receiver has to strip the additional bytes from the message.

RAW-receivers cannot receive packets sent with TCP or UDP because the operating system will not deliver the packets to a RAW-receiver. The operating system knows some protocols on top of IP and will decide on its own which packet to deliver to which process see Richard Stevens' home page and books. Therefore we have to use the UDP-receiver for receiving UDP datagrams sent with the RAW-sender. This is a dark corner - not only of GAWK but also of TCP/IP implementations. Those few interested in playing with protocols will benefit from the approach implemented in a tool called SPAK, see. This tool reflects the hierarchical layering of protocols (encapsulation) in the was data streams are piped out of one program into the next one. You can see which protocol is based on which other (lower level) protocol by looking at the command line ordering of the program calls. Cleverly thought out, SPAK will serve you much better than GAWK's `/inet' if you want to learn the meaning of each and every bit in the protocol headers.

We will use the RAW protocol to emulate the behaviour of UDP. The sender program is the same as above but with some additional bytes that fill the places of the UDP fields.

BEGIN {
  Message = "Hello world\n"
  SourcePort = 0
  DestinationPort = 8888
  MessageLength = length(Message)+8
  RawService = "/inet/raw/0/localhost/0"
  printf("%c%c%c%c%c%c%c%c%s", SourcePort/256, SourcePort%256,
                               DestinationPort/256, DestinationPort%256,
                               MessageLength/256, MessageLength%256,
                               0, 0, Message) |& RawService
  fflush(RawService)
}

Since we try to emulate the behaviour of UDP, we will check if the RAW-sender is understod by the UDP-receiver but not if the RAW-receiver can understand the UDP-sender (see above). In a real network, the RAW-receiver will hardly be of any use because it gets every IP packet that comes across the network. There will usually be so many packets that GAWK is too slow for processing them. Only on a network with little traffic can you test the IP-level receiver program. Programs for analyzing IP traffic on modem- or ISDN-channels should be possible.

Port numbers do not have a meaning when using /inet/raw. Their fields have to be 0 or empty. Only TCP and UDP know them. Receiving data from /inet/raw is difficult not only because of processing speed but also because data will usually be binary and not restricted to ASCII. This implies that line separation with RS will not work as usual.

Some Applications and Techniques

In this chapter, we will have a look at some self contained scripts that meet at least one of these criteria:

  1. Building blocks that encapsulate often needed functions of the networking world.
  2. New techniques that broaden the scope of problems that can be solved with GAWK.
  3. Exploring leading edge technology that may shape the future of networking.

Here, the emphasis is on concise networking. The applications mentioned near the end of the first chapter do not meet this requirement because they will result in long programs that need careful examination of many special cases and indepth knowledge of vast areas of well established fields other than networking.

We will often refer to the site independent core of the server that we built in the first chapter. This means the BEGIN part of the ELIZA program. When building new and non-trivial servers, we will always copy this building block and append new instances of the two functions SetUpServer() and HandleGET().

Does it really make sense to employ this same scheme again and again with varying content ? Yes, because this scheme of event-driven execution provides GAWK with an interface to the most widely accepted standard for GUIs: the web browser. Now, GAWK can even rival Tcl/Tk. Tcl and GAWK have much in common. Both are simple scripting languages that allow us to quickly solve problems with short programs. But Tcl has Tk on top of it and GAWK had nothing comparable up to now. While Tcl needs a large and ever changing libray (Tk, which was bound to the X-Windows environment until recently), GAWK needs just the networking interface and some kind of browser on the client's side. Besides better portability, the most important advantage of this approach (embracing well established standards like HTTP and HTML) is, that we do not need to change the language. We let others do the work of fighting over protocols and standards. We use HTML, JavaScript, VRML or whatever comes along to do our work.

PANIC - an emergency web server

You thought the "Hello world" example in the first chapter was useless ? By adding just a few lines, we can turn it into something useful. The PANIC program below tells everyone who connects, that the local site is not working. When a web server breaks down, it makes a difference if customers get a strange "network unreachable" message or a short message telling them that the server has a problem. In such a case of emergency the hard disk and everything on it (including the regular web service) may be unavailable. Rebooting the web server off a diskette makes sense in this setting.

To start the PANIC program as an emergency web server, you need just the GAWK executable and the program below on your diskette. By default, it will connect to port 8080. You can supply a different value from the command line.

BEGIN {
  RS = ORS = "\r\n"
  if (MyPort ==  0) MyPort = 8080
  HttpService = "/inet/tcp/" MyPort "/0/0"
  Hello = "<HTML><H1>This site is temporarily out of service.</H1></HTML>"
  while ("GAWK" < "Perl") {
    print "HTTP/1.0 200 OK"                                  |& HttpService
    print "Content-Length: " length(Hello) + length(ORS) ORS |& HttpService
    print Hello                                              |& HttpService
    while ((HttpService |& getline) > 0) ;
    close(HttpService)
  }
}

GETURL - retrieving web pages

GETURL is a versatile building block for shell scripts that need to retrieve files from the net. It takes a web address as command line parameter and tries to retrieve the contents of this address. The contents is printed via standard ouput, while the header is printed via `/dev/stderr'. A surrounding shell script could analyze the contents and extract the text or the links. An ASCII browser could be written around GETURL. But also web robots are straightforward to write on top of GETURL. On the Internet, you can find several programs of the same name that do the same job. They are usually much more complex internally and at least 10 time longer.

At first, GETURL checks if it was called with exactly one web address. Then, it checks if the user chose to use a special proxy server whose name he handed over in a variable. By default, it is assumed that the local machine serves as proxy. GETURL uses the GET method by default to access the web page. By handing over the name of a different method (like HEAD) it is possible to choose a different behaviour. With the HEAD method, the user will not receive the body of the page content but the header.

BEGIN {
  if(ARGC != 2) {
    print "GETURL - retrieve web page via HTTP 1.0"
    print "IN:\n    the URL as a command line parameter"
    print "PARA:\n    -v Proxy=MyProxy"
    print "OUT:\n    the page content on stdout"
    print "    the page header  on stderr"
    print "JK 16.05.97"
    exit
  }
  URL = ARGV[1]; ARGV[1] = ""
  if (Proxy     == "")  Proxy     = "127.0.0.1"
  if (ProxyPort ==  0)  ProxyPort = 80
  if (Method    == "")  Method    = "GET"
  HttpService = "/inet/tcp/0/" Proxy "/" ProxyPort
  RS = "\r\n\r\n"
  printf Method " " URL " HTTP/1.0" RS |& HttpService
  HttpService                          |& getline Header
  printf Header RS > "/dev/stderr"
  while ((HttpService |& getline) > 0) printf "%s", $0
}

You may change this program as needed, but be careful with the last lines. Make sure transmission of binary data is not corrupted by additional line breaks. Even as it is now, the byte sequence "\r\n\r\n" would disappear if it was contained in binary data. You might get caught in a trap when trying a quick fix on this one.

REMCONF - remote configuration of embedded systems

Today, you often find powerful processors in embedded systems. Dedicated network routers and controllers for all kinds of machinery are examples of embedded systems. Processors like the Intel x86 or the AMD Elan are able to run multitasking operating systems like XINU or Linux in embedded PCs.

These systems are small and usually do not have a keyboard or a display. Therefore it is difficult to set up their configuration. There are several widespread ways to set them up:

  1. DIP switches
  2. Read Only Memories like EPROMs
  3. Serial Lines or some kind of keyboard
  4. Network connections via telnet or SNMP
  5. HTTP connections with HTML GUIs

In this section, we will look at a solution that uses HTTP connections to control variables of an embedded system that are stored in a file. Since embedded systems have tight limits on resources like memory, it is difficult to employ advanced techniques like SNMP- and HTTP- servers. So, GAWK fits in quite nicely with its single executable that needs just a short script to start working. The following program stores the variables in a file and a concurrent process in the embedded system may read the file. The program uses the site independent part of the simple web server that we developed in the first chapter. As mentioned there, all we have to do is to write two new procedures SetUpServer() and HandleGET().

function SetUpServer() {
  TopHeader = "<HTML><title>Remote Configuration</title>"
  TopDoc = "\
    <h2>Please choose one of the following actions:</h2>\
    <UL>\
      <LI>About this server\
      <LI>Read Configuration\
      <LI>Check Configuration\
      <LI>Change Configuration\
      <LI>Save Configuration\
    </UL>"
  TopFooter  = "</HTML>"
  if (ConfigFile == "") ConfigFile = "config.asc"
}

The function SetUpServer() initializes the top level HTML texts as usual. It also initializes the name of the file that contains the configuration parameters and their values. In case the user supplied a name from the command line, that name is used. The file is expected to contain one parameter per line, with the name of the parameter in column one and the value in column two.

The function HandleGET() reflects the structure of the menue tree as usual. The first menue choice tells the user what this is all about. The second choice reads the configuration file line by line and stores the parameters and their values. Notice that the record separator for this file is \n in contrast to the record separator of the HTTP. The third menue choice builds up an HTML table to show the content of the configuration file just read. The fourth choice does the real work of changing parameters and the last one just saves the configuration into a file.

function HandleGET() {
  if(MENUE[2] =="AboutServer") {
    Document  = "This is a GUI for remote configuration of an\
      embedded system. It is is implemented as one GAWK script."
  } else if (MENUE[2]=="ReadConfig") {
    RS = "\n"
    while ((getline < ConfigFile) > 0) config[$1] = $2;
    close(ConfigFile)
    RS = "\r\n"
    Document = "Configuration has been read."
  } else if (MENUE[2]=="CheckConfig") {
    Document = "<TABLE BORDER CELLPADDING=5>"
    for (i in config)
      Document = Document "<TR><TD>" i "</TD><TD>" config[i] "</TD></TR>"
    Document = Document "</TABLE>"
  } else if (MENUE[2]=="ChangeConfig") {
    if ("Param" in GETARG) {            # any parameter to set ?
      if (GETARG["Param"] in config) {  # is  parameter valid  ?
        config[GETARG["Param"]] = GETARG["Value"]
        Document = GETARG["Param"] " = " GETARG["Value"] "."
      } else {
        Document = "Parameter <b>" GETARG["Param"] "</b> is invalid." 
      }
    } else {
      Document = "<FORM method=GET><h4>Change one parameter</h4>\
        <TABLE BORDER CELLPADDING=5>\
        <TR><TD>Parameter</TD><TD>Value</TD></TR>\
        <TR><TD><input type=text name=Param value=\"\"size=20></TD>\
            <TD><input type=text name=Value value=\"\"size=40></TD>\
        </TR></TABLE><input type=submit value=\"Set\"></FORM>"
    }
  } else if (MENUE[2]=="SaveConfig") {
    printf "" > ConfigFile     # make this file an empty file
    for (i in config) printf("%s %s\n", i, config[i]) >> ConfigFile
    close(ConfigFile)
    Document = "Configuration has been saved."
  }
}

We could also view the configuration file as a data base. From this point of view, the above program acts like a primitive data base server. Real SQL data base systems also make a service available by providing a TCP port that clients can connect to. But the application level protocols they use are usually proprietary and also change from time to time. This is also true for the protocol that MiniSQL uses.

URLCHK - look for changed web pages

Most people who make heavy use of Internet resources have a large bookmark file with pointers to interesting web sites. It is impossible to regularly check by hand if any of these sites has changed. A program is needed to automatically look at the headers of web pages and tell, which ones have changed. URLCHK does the comparison after using GETURL with the HEAD method to retrieve the header.

Like GETURL, this program checks first if it was called with exactly one command line parameter. URLCHK also takes the same variables Proxy and ProxyPort from the command line as GETURL because these variables are handed over to GETURL for each URL that gets checked. The one and only parameter is the name of a file that contains one line for each URL. In the first column we find the URL and the second and third columns hold the length of the URL's body when checked for the two last times. Now, we follow this plan:

  1. read the URLs from the file and remember their most recent lengths
  2. delete the contents of the file
  3. for each URL, check its new length and write it into the file
  4. if the most recent and the new length differ, tell the user

It may seem a bit peculiar to read the URLs from a file together with their two most recent lengths but this approch has several advantages. You can call the program again and again with the same file. After running the program, you can regenerate the changed URLs by extracting those lines that differ in their second and third columns.

BEGIN {
  if(ARGC != 2) {
    print "URLCHK - check if URLs have changed"
    print "IN:\n    the file with URLs as a command line parameter"
    print "    file contains URL, old length, new length"
    print "PARA:\n    -v Proxy=MyProxy -v ProxyPort=8080"
    print "OUT:\n    same as file with URLs"
    print "JK 02.03.98"
    exit
  }
  URLfile = ARGV[1]; ARGV[1] = ""
  if (Proxy     != "") Proxy     = "-vProxy="     Proxy     " "
  if (ProxyPort != "") ProxyPort = "-vProxyPort=" ProxyPort " "
  while ((getline < URLfile) > 0) Length[$1] = $3 + 0
  close(URLfile)      # now, URLfile is read in and can be updated
  printf "" > URLfile # make URLfile an empty file
  GetHeader = "gawk " Proxy ProxyPort " -vMethod=\"HEAD\" -f geturl.awk "
  for (i in Length) {
    GetThisHeader = GetHeader i " 2>&1"
    while ((GetThisHeader | getline) > 0)
      if (toupper($0) ~ /CONTENT\-LENGTH/) NewLength = $2 + 0
    close(GetThisHeader)
    print i, Length[i], NewLength >> URLfile
    if (Length[i] != NewLength)  # report only changed URLs
      print i, Length[i], NewLength
  }
}

Another thing that may look strange is the way GETURL gets called. Before calling GETURL, we have to look if the proxy variables need to be handed over. If so, we prepare strings that will become part of the command line later. In GetHeader we store these strings together with the longest part of the command line. Later, in the loop over the URLs, GetHeader is appended with the URL and a redirection operator to form the command that reads the URL's header over the net. GETURL always produces the headers over `/dev/stderr'. That is the reason why we need the redirection operator to get the header piped in.

This program is not perfect because it assumes that changing URLs results in changed lengths, which is not necessarily true. A more advanced approch would be to look at some other header line that holds time information. But as always when things get a bit more complicated, this is left as an exercise to the reader.

WEBGRAB - extract links from a page

Sometimes it is necessary to extract links from web pages. Browsers do it, web robots do it and sometimes even a human being wants to extract all links in a web page. Having a tool like GETURL at hand, we can solve this problem with a one-liner in the bourne shell.

BEGIN {RS="http\://[#%&\+\-\./0-9\:;\?A-Z_a-z\~]*"};
RT != "" { print "gawk -vProxy=MyProxy -f geturl.awk", RT, "> doc" NR ".html" }

This program reads an HTML file and prints all HTTP-links that it finds. It relies on GAWK's ability to use regular expressions as record separators. With RS set to a regular expression that matches links, the second action is executed each time a non-empty link has been found. We can find the matching link itself in RT. Notice, that the regular expression for URLs is rather crude. A precise regular expression would be much more complex. But the one above works rather well. It is also unable to find internal links of an HTML document. Furthermore, it is straightforward to also include FTP, Telnet, News, Mailto and other kinds of links in the regular expression if this was necessary for other tasks.

The action could use the system() call to let just another GETURL retrieve the page, but here we use a different approach. Instead, we let the one-liner print a shell command that can be piped into a sh command in the bourne shell. This way it is possible to first extract the links, wrap shell commands around them and pipe all the shell commands into a file. After editing the file, execution of the file will retrieve exactly those files that we really need. In case we do not want to edit, we can retrieve all the pages like this:

gawk -f geturl.awk http://www.suse.de | gawk -f webgrab.awk | sh

After this, you will find the contents of all referenced documents in files named `doc*.html' even if they do not contain HTML code. The most annoying thing is that we always have to pass the proxy to GETURL. If you do not like to see the headers of the web pages appear on the screen, you can redirect them to `/dev/null'. Watching the headers appear can be quite interesting because you can see interesting details like which web server the companies use. Now, you can imagine how the robots work that tell the clever marketing people the market shares of Microsoft and Netscape in the web server market.

Port 80 of any web server is like a small hole in a repellent firewall. After attaching a browser to port 80, we will usually catch a glimpse of the bright side of the server (its home page). With a tool like GETURL at hand, we are able to discover some of the more concealed or even indecent (i.e. lacking conformity to standards of quality) services. It can be exciting to see the fancy CGI scripts that lie there, revealing the inner working of the server, and ready to be called.

Although this may sound funny or simply irrelevant, we are talking about severe security holes. Try to explore your own system this way and make sure that none of the above reveals too much information about your system.

STATIST - graphing a statistical distribution

In all the HTTP server examples above, we have never presented an image to the browser and its user. Presenting images is one task. Generating images that reflect some user input and presenting these dynamically generated images is another. In this section we will use GNUPLOT for generating `.gif' files. Due to licensing problems, the default installation of GNUPLOT has the generation of `.gif' files switched off. If your installed version does not accept set term gif, just download and install the most recent version of GNUPLOT and the GD library by Thomas Boutell. Otherwise you still have the chance to generate some ASCII-art style images with GNUPLOT by using set term dumb (I tried it and it worked).

The program we will develop takes the statistical parameters of two samples and computes the t-test statistics. As a result, we get the probabilities that the means and the variances of both sample are the same. In order to let the user check plausibility, the program presents an image of the distributions. The statistical computation follows Numerical Recipes by Press et al. Since GAWK does not have a builtin function for the computation of the beta function, we use the ibeta() of GNUPLOT. As a side effect, we will learn how to use GNUPLOT as a sophisticated calculator. The comparison of means is done as in tutest, paragraph 14.2 page 613 and the comparison of variances as in ftest, page 611 in Numerical Recipes.

As usual, we take the site independent code for servers and append our own functions SetUpServer() and HandleGET().

function SetUpServer() {
  TopHeader = "<HTML><title>Statistics with GAWK</title>"
  TopDoc = "\
    <h2>Please choose one of the following actions:</h2>\
    <UL>\
      <LI>About this server\
      <LI>Enter Parameters\
    </UL>"
  TopFooter  = "</HTML>"
  GnuPlot    = "gnuplot 2>&1"
  m1=m2=0;    v1=v2=1;    n1=n2=10
}

Here, you see the menue structure that the user will see. Later, we will see how the program structure of the HandleGET() function reflects the menue structure. What is missing here, is the link for the image we will generate. In an event driven environment, request, generation and delivery of images are separated.

Notice the way we initialize the GnuPlot pipe. By default, GNUPLOT will output the generated image via standard output and the results of printed calculations via standard error. The redirection will cause standard error to be mixed into standard ouput, enabling us to read results of calculations with getline. By initializing the statistical parameters with some meaningful defaults, we make sure the user will get an image the first time he uses the program.

function HandleGET() {
  if(MENUE[2] =="AboutServer") {
    Document  = "This is a GUI for a statistical computation.\
      It compares means and variances of two distributions.\
      It is implemented as one GAWK script and uses GNUPLOT."
  } else if (MENUE[2]=="EnterParameters") {
    Document = ""
    if ("m1" in GETARG) {     # are there parameters to compare ?
      Document = Document ""
      m1=GETARG["m1"]; v1=GETARG["v1"]; n1=GETARG["n1"]
      m2=GETARG["m2"]; v2=GETARG["v2"]; n2=GETARG["n2"]
      t = (m1-m2)/sqrt(v1/n1+v2/n2)
      df = (v1/n1+v2/n2)*(v1/n1+v2/n2)/((v1/n1)*(v1/n1)/(n1-1) + (v2/n2)*(v2/n2)/(n2-1))
      if (v1>v2) { f = v1/v2;      df1=n1-1;       df2=n2-1
      } else     { f = v2/v1;      df1=n2-1;       df2=n1-1
      }
      print "pt=ibeta(" df/2 ",0.5," df/(df+t*t) ")"                 |& GnuPlot
      print "pF=2.0*ibeta(" df2/2 "," df1/2 "," df2/(df2+df1*f) ")"  |& GnuPlot
      print "print pt, pF"                                           |& GnuPlot
      RS="\n"; GnuPlot |& getline; RS="\r\n"    # $1 is pt, $2 is pF
      print "invsqrt2pi=1.0/sqrt(2.0*pi)"                            |& GnuPlot
      print "nd(x)=invsqrt2pi/sd*exp(-0.5*((x-mu)/sd)**2)"           |& GnuPlot
      print "set term gif medium size 320,240"                       |& GnuPlot
      print "set yrange[-0.3:]"                                      |& GnuPlot
      print "set label 'p(m1=m2) =" $1 "' at 0,-0.1 left"            |& GnuPlot
      print "set label 'p(v1=v2) =" $2 "' at 0,-0.2 left"            |& GnuPlot
      print "plot mu=" m1 ",sd=" sqrt(v1) ", nd(x) title 'sample 1',\
                  mu=" m2 ",sd=" sqrt(v2) ", nd(x) title 'sample 2'" |& GnuPlot
      print "quit"                                                   |& GnuPlot
      GnuPlot |& getline Image
      while ((GnuPlot |& getline) > 0) Image = Image RS $0
      close(GnuPlot)
    }
    Document = Document "\
    <h3>Do these samples have the same Gaussian distribution ?</h3>\
    <FORM METHOD=GET> <TABLE BORDER CELLPADDING=5>\
    <TR>\
    <TD>1. Mean    </TD><TD><input type=text name=m1 value=" m1 " size=8></TD>\
    <TD>1. Variance</TD><TD><input type=text name=v1 value=" v1 " size=8></TD>\
    <TD>1. Count   </TD><TD><input type=text name=n1 value=" n1 " size=8></TD>\
    </TR><TR>\
    <TD>2. Mean    </TD><TD><input type=text name=m2 value=" m2 " size=8></TD>\
    <TD>2. Variance</TD><TD><input type=text name=v2 value=" v2 " size=8></TD>\
    <TD>2. Count   </TD><TD><input type=text name=n2 value=" n2 " size=8></TD>\
    </TR>                   <input type=submit value=\"Compute\">\
    </TABLE></FORM><BR>"
  } else if (MENUE[2]=="Image") {
    Reason = "OK" ORS "Content-type: image/gif"
    Header = Footer = ""
    Document = Image
  }
}

As usual, we give a short description of the service in the first menue choice. The third menue choice shows us that generation and presentation of an image are two separate actions. While the latter takes place quite instantly in the third menue choice, the former takes place in the much longer second choice. Image data passes from the generating action to the presenting action via the variable Image that contains a complete .gif image that would otherwise be stored in a file. A bit unusual is the way we pass the Content-type to the browser. It is appended to the OK of the first header line to make sure, the type information will become part of the header. The other variables that get transmitted acrosss the network are made empty because in this case we do not have an HTML document to transmit but nothing but raw image data to be contained in the body.

Most of the work is done in the second menue choice. It can be broken down into two phases. At first, we check if there are statistical parameters. When you first start the program, there usually are no parameters because you are entering the page coming from the top menu. Then, we only have to present the user a form that he can use to change statistical parameters and submit them. Subsequently, the submission of the form will cause the execution of the first phase because now there are parameters to handle.

Now that we have parameters, we know there will be an image available. Therefore we place a link to the image at the top of the page. Then, we prepare some variables that will be passed to GNUPLOT for calculation of the probabilities. Prior to reading the results, we must temporarily change RS because GNUPLOT separates lines with newlines. After instructing GNUPLOT to generate a .gif image with given font, width and height, we initiate the insertion of some text, explaining the resulting probabilities. The final plot command actually generates image data. This raw binary has to be read in carefully without adding, changing or deleting a single byte. Hence the unusual initialization of Image and completion with a while loop.

When using this server, you will soon realize that it is far from being perfect. After the first submission of parameters, your browser will show you an image. But due to the caching strategy of most browsers, there will be no automatic reload of images. So, after pressing the submission button, you also have to press the reload button. This problem is not a GAWK problem, it is one of the shortcomings of the HTML- and HTTP-based WWW. One can solve the problem with a JavaScript program. Furthermore, the statistical part of the server does not take care of invalid input. Among others, using negative variances will cause invalid results.

MAZE - walking through a maze in virtual reality

By now, we know how to present arbitrary Content-types to a browser. In this section, our server will present a 3D world to our browser. The 3D world is described in a scene description language (VRML, Virtual Reality Modeling Language) that allows us to travel through a perspective view of a 2D maze with our browser. If your browser has a VRML plugin, you will be able to explore this new technology. We could do one of those boring Hello world examples here, that are usually presented when introducing novices to VRML. If you have never written any VRML code, have a look at the VRML FAQ. Presenting a static VRML scene is a bit trivial; in order to expose GAWK's new capabilities, we will present a dynamically generated VRML scene. The function SetUpServer() is very simple because it only sets the default HTML page and initializes the random number generator. As usual, the surrounding server lets you browse the maze.

function SetUpServer() {
  TopHeader = "<HTML><title>Walk through a maze</title>"
  TopDoc = "\
    <h2>Please choose one of the following actions:</h2>\
    <UL>\
      <LI>About this server\
      <LI>Watch a simple VRML scene\
    </UL>"
  TopFooter  = "</HTML>"
  srand()
}

The function HandleGET() is a bit longer because it first computes the maze and afterwards generates the VRML code that will be sent across the network. As shown in the STATIST example, we set the type of the content to VRML and then store the VRML representation of the maze as the page content. We assume that the maze is stored in a 2D array. Initially, the maze consists of walls only. Then, we add an entry and an exit to the maze and let the rest of the work be done by the function MakeMaze. Now, only the wall fields are left in the maze. By iterating over the these fields, we generate one line of VRML code for each wall field.

function HandleGET() {
  if(MENUE[2] =="AboutServer") {
    Document  = "If your browser has a VRML 2 plugin,\
      this server shows you a simple VRML scene."
  } else if (MENUE[2]=="VRMLtest") {
    XSIZE=YSIZE=11                 # initially, everything is wall
    for(y=0; y<YSIZE; y++) for(x=0; x<XSIZE; x++) Maze[x, y] = "#"
    delete Maze[0, 1]              # entry is not wall
    delete Maze[XSIZE-1, YSIZE-2]  # exit  is not wall
    MakeMaze(1, 1)
    Document = "\
#VRML V2.0 utf8\n\
Group {\n\
  children [\n\
    PointLight {\n\
      ambientIntensity 0.2\n\
      color 0.7 0.7 0.7\n\
      location 0.0 8.0 10.0\n\
    }\n\
    DEF B1 Background {\n\
      skyColor [0 0 0, 1.0 1.0 1.0 ]\n\
      skyAngle 1.6\n\
      groundColor [1 1 1, 0.8 0.8 0.8, 0.2 0.2 0.2 ]\n\
      groundAngle [ 1.2 1.57 ]\n\
    }\n\
    DEF Wall Shape {\n\
      geometry Box {size 1 1 1}\n\
      appearance Appearance { material Material { diffuseColor 0 0 1 } }\n\
    }\n\
    DEF Entry Viewpoint {\n\
      position 0.5 1.0 5.0\n\
      orientation 0.0 0.0 -1.0 0.52\n\
    }\n"
    for (i in Maze) {
      split(i, t, SUBSEP)
      Document = Document "    Transform { translation "
      Document = Document t[1] " 0 -" t[2] " children USE Wall }\n"
    }
    Document = Document "  ] # end of group for world\n}"
    Reason = "OK" ORS "Content-type: model/vrml"
    Header = Footer = ""
  }
}

Finally, we have a look at MakeMaze(), the function that generated the Maze array. When entered, this function assumes that the array has been initialized so that each element represents a wall element and the maze is initially full of wall elements. Only the entrance and the exit of the maze should have been left free. The parameters of the function tell us, which element must be marked as not being a wall. After this, we take a look at the four neighbouring elements and remember, which we have already treated. Of all the neighbouring elements, we take one at random and walk in that direction. Therefore, the wall element in that direction has to be removed and then, we call the function recursively for that element. The maze will only be completed, if we reiterate the above procedure for all neighbouring elements (in random order) and for our present element by recursively calling the function for the present element. This last iteration could have been done in a loop, but it is done much simpler recursively.

Notice that elements with coordinates that are both odd are assumed to be on our way through the maze and the generating process cannot terminate as long as there is such an element not being deleted. All other elements are potentially part of the wall.

function MakeMaze(x, y) {
  delete Maze[x, y]     # here we are, we have no wall here
  p = 0                 # count unvisited fields in all directions
  if (x-2 SUBSEP y   in Maze) d[p++] = "-x"
  if (x   SUBSEP y-2 in Maze) d[p++] = "-y"
  if (x+2 SUBSEP y   in Maze) d[p++] = "+x"
  if (x   SUBSEP y+2 in Maze) d[p++] = "+y"
  if (p>0) {            # if there are univisited fields, go there
    p = int(p*rand())   # choose one unvisited field at random
    if        (d[p] == "-x") { delete Maze[x - 1, y]; MakeMaze(x - 2, y)
    } else if (d[p] == "-y") { delete Maze[x, y - 1]; MakeMaze(x, y - 2)
    } else if (d[p] == "+x") { delete Maze[x + 1, y]; MakeMaze(x + 2, y)
    } else if (d[p] == "+y") { delete Maze[x, y + 1]; MakeMaze(x, y + 2)
    }                   # we are back from recursion
    MakeMaze(x, y);     # try again while there are unvisited fields
  }
}

MOBAGWHO - a simple mobile agent

A mobile agent is a program that may be dispatched from a computer and transported to a remote server for execution. This is called migration and means that a process on another system is started that is independent from its originator. Ideally, it wanders through a network while working for its creator or owner. In places like the UMBC Agent Web people are quite confident that (mobile) agents are a software engineering paradigm that will enable us to significantly increase the efficiency of our work. Mobile agents could become the mediators between users and the networking world. If you appreciate an unbiased view at this technology, you should have a look at the remarkable paper Mobile Agents: Are they a good idea ?.

Anyway, sounds interesting, let us have a try. A good instance of this paradigm is Agent Tcl, an extension of the Tcl language. After introducing you to a typical development environment, the aforementioned paper shows a nice little example application that we will try to rebuild in GAWK. The `who' agent takes a list of servers and wanders from one server to the next one, always looking, who is logged in. Having reached the last one, it sends us a message with a list of all users it found on each machine.

But before implementing something that might or might not be a mobile agent, let us clarify the concept and some important terms. The agent paradigm in general is such a young scientific discipline that it has not yet developed a widely accepted terminology. Some authors try to give precise definitions, but their scope is often not wide enough to be generally accepted. Franklin and Graesser ask Is it an Agent or just a Program: A Taxonomy for Autonomous Agents and give even better answers than Caglayan and Harrison in their Agent FAQ.

Before delving into the (rather demanding) details of implementation, let me give just one more quotation as a final motivation. Steven Farley published an excellent paper called Mobile Agent System Architecture, in which he asks Why use an agent architecture ?

If client-server systems are the currently established norm and distributed object systems such as CORBA are defining the future standards, why bother with agents? Agent architectures have certain advantages over these other types. Three of the most important advantages are:

1. An agent performs much processing at the server where local bandwidth is high, thus reducing the amount of network bandwidth consumed and increasing overall performance. In contrast, a CORBA client object with the equivalent functionality of a given agent must make repeated remote method calls to the server object because CORBA objects cannot move across the network at runtime.

2. An agent operates independently of the application from which the agent was invoked. The agent operates asynchronously, meaning that the client application does not need to wait for the results. This is especially important for mobile users who are not always connected to the network.

3. The use of agents allows for the injection of new functionality into a system at run time. An agent system essentially contains its own automatic software distribution mechanism. Since CORBA has no built-in support for mobile code, new functionality generally has to be installed manually.

Of course a non-agent system can exhibit these same features with some work. But the mobile code paradigm supports the transfer of executable code to a remote location for asynchronous execution from the start. An agent architecture should be considered for systems where the above features are primary requirements.

When trying to migrate a process from one system to the next, we need of course a server process on the receiving side. Depending on the kind of server process, several ways of implementation come to mind:

We will abuse a common web server as a migration tool. So, we need a universal CGI script on the receiving side (the web server). It will be activated with a POST request. Put it into a location like `/httpd/cgi-bin/PostAgent.sh'. Make sure, the server system uses a version of GAWK that supports network access.

#!/bin/sh
MobAg=/tmp/MobileAgent.$$
cat > $MobAg                             # direct script to mobile agent file
gawk -f $MobAg $MobAg > /dev/null &      # execute agent concurrently
gawk 'BEGIN{print "\r\nAgent started"}'  # HTTP header, terminator and body
rm $MobAg                                # delete script file of agent

By making its process id $$ part of the unique file name, the script avoids conflicts between concurrent instances of the script. First, all lines from standard input (the mobile agent's source code) are copied into this unique file. Then, the agent is started as a concurrent process and a short message reporting this fact is sent to the submitting client. Finally, the script file of the mobile agent is removed because it is no longer needed. Although a short script, there are several noteworthy points about it:

The originating agent itself is started just like any other command line script and reports the results on standard output. But how can an agent that migrated to a host far away from its origin report the result back home when there is no connection any more ? By letting the name of the original host migrate with the agent. Having arrived at the end of the journey, the agent establishes a connection and reports the results. This is the reason for determining the name of the host with `uname -n' and storing it in MyOrigin for later use. We may also set variables with the `-v' option from the command line. This interactivity is only of importance in the context of starting a mobile agent, therefore this BEGIN pattern and its action will not take part in migration.

BEGIN {
  if (ARGC != 2) {
    print "MOBAG - a simple mobile agent"
    print "CALL:\n    gawk -f mobag.awk mobag.awk"
    print "IN:\n    the name of this script as a command line parameter"
    print "PARA:\n    -vMyOrigin=myhost.com"
    print "OUT:\n    the result on stdout"
    print "JK 29.03.98 01.04.98"
    exit
  }
  if (MyOrigin == "") "uname -n" | getline MyOrigin
}

Since GAWK cannot manipulate and transmit parts of the program directly, we have to read the source code and store it in strings. Therefore, we scan it for the beginning and the ending of functions. Each line in between will be appended to the code string until the end of the function has been reached. A special case is this part of the program itself. It is not a function. We put a similar frame around it to treat it like a function. Notice that this mechanism will work for all the functions of the source code, but it cannot guarantee that the order of the functions will be preserved during migration.

#ReadMySelf
/^function /                            { FUNC = $2 }
/^END/ || /^#ReadMySelf/                { FUNC = $1 }
FUNC != ""                              { MOBFUN[FUNC] = MOBFUN[FUNC] RS $0 }
(FUNC!="") && (/^}/ || /^#EndOfMySelf/) { FUNC = "" }
#EndOfMySelf

When we built the web server code in the first chapter, we first developed a site independent core. Likewise, we now build an agent independent core that can be appended with application dependent functions. Meanwhile, we have already reached the only application independent function we need for the mobile agent. The function migrate() prepares the aforementioned strings containing the program code and transmits them to a server. A consequence of this modular approach is that the migrate() function takes some parameters that we will not need in this application but in future ones. Its mandatory parameter Destination holds the name (or IP address) of the server that the agent wants as a host for its code. The optional parameter MobCode may contain some GAWK code that will be inserted during migration in front of all other code. The optional parameter Label may contain a string that tells the agent what to do in program execution after arrival at its new home site. One of the serious obstacles in implementing a framework for mobile agents is that it does not suffice to migrate the code. We also have to migrate the state of execution of the agent. In contrast to Agent Tcl, we do not try to migrate the complete set of variables. We introduce the following convention.

Now, you can understand what happens to the Label parameter of the function migrate(). It is copied into MOBVAR["Label"] and travels alongside the other data. Since travelling takes place via HTTP, we have to separate records with "\r\n" in RS and ORS as usual. The code assembly for migration takes place in three steps:

function migrate (Destination, MobCode, Label) {
  MOBVAR["Label"] = Label
  MOBVAR["Destination"] = Destination
  RS = ORS = "\r\n"
  HttpService = "/inet/tcp/0/" Destination
  for (i in MOBFUN) MobCode = MobCode "\n" MOBFUN[i]
  MobCode = MobCode  "\n\nBEGIN {"
  for (i in MOBVAR) MobCode = MobCode "\n  MOBVAR[\"" i "\"] = \"" MOBVAR[i] "\""
  MobCode = MobCode "\n}\n"
  print "POST /cgi-bin/PostAgent.sh HTTP/1.0"  |& HttpService
  print "Content-length:", length(MobCode) ORS |& HttpService
  printf MobCode                               |& HttpService
  while ((HttpService |& getline) > 0) print $0
}

The application independent framework is now almost complete. What follows is the END pattern that is executed when the mobile agent has finished reading its own code. First, it checks whether it is already running on a remote host or not. In case initialization has not yet taken place, it starts MyInit(). Otherwise (later, on a remote host) it starts MyJob().

END {
  if (ARGC != 2) exit    # stop when called with wrong parameters
  if (MyOrigin != "")    # is this the originating host ?
    MyInit()             # then we initialize the application
  else                   # we are on a host with migrated data
    MyJob()              # so we do our job
}

All we have to do to extend the framework into a complete application is to write two application specific functions MyInit() and MyJob(). Keep in mind that the former is executed once on the originating host, while the latter is executed after each migration.

function MyInit() {
  MOBVAR["MyOrigin"] = MyOrigin
  MOBVAR["Machines"] = "localhost/80 max/80 moritz/80 castor/80"
  split(MOBVAR["Machines"], Machines)         # which host is the first ?
  migrate(Machines[1], "", "")                # go to the first host
  while (("/inet/tcp/8080/0/0" |& getline)>0) # wait for result
    print $0                                  # print result
}

As mentioned earlier, this agent takes the name of its origin (MyOrigin) with it. Then, it takes the name of its first destination and goes there for further work. Notice, that this name has the port number of the web server appended to the name of the server because the function migrate() needs it this way. Finally, it waits for the result to arrive.

function MyJob() {
  sub(MOBVAR["Destination"], "", MOBVAR["Machines"])   # forget this host
  MOBVAR["Result"]=MOBVAR["Result"] SUBSEP SUBSEP MOBVAR["Destination"] ":"
  while (("who" | getline) > 0)               # who is logged in ?
    MOBVAR["Result"] = MOBVAR["Result"] SUBSEP $0
  if (index(MOBVAR["Machines"], "/") > 0) {   # any more machines to visit ?
    split(MOBVAR["Machines"], Machines)       # which host is next ?
    migrate(Machines[1], "", "")              # go there
  } else {                                    # no more machines
    gsub(SUBSEP, "\n", MOBVAR["Result"])      # send result to origin
    print MOBVAR["Result"] |& "/inet/tcp/0/" MOBVAR["MyOrigin"] "/8080"
  }
}

After migrating, the first thing to do in MyJob() is to delete the name of the current host from the list of hosts to visit. Now, it is time to start the real work by appending the host's name to the result string and reading line by line who is logged in on this host. A very annoying circumstance is the fact that the elements of MOBVAR cannot hold the newline character (`\n'). If they did, migration of this string would not work because the string would not obey the syntax rule for a string in GAWK. As a replacement, we use SUBSEP temporarily. If the list of hosts to visit holds at least one more entry, the agent migrates to that place to go on working there. Otherwise, it is time to replace the SUBSEPs with a newline character in the resulting string and report it to the originating host, whose name is stored in MOBVAR["MyOrigin"].

Index

.

  • .gif
  • .xbm
  • /

  • `/inet/raw', `/inet/raw'
  • `/inet/tcp', `/inet/tcp'
  • `/inet/udp', `/inet/udp'
  • a

  • agent, agent
  • AI
  • apache, apache
  • b

  • blocking
  • Boutell, Thomas, Boutell, Thomas
  • c

  • CGI, CGI, CGI, CGI
  • client
  • Clinton, Bill
  • Coke machine
  • Contest
  • d

  • dark corner
  • e

  • Earthquake Bulletin
  • ELIZA, ELIZA
  • f

  • finger, finger
  • g

  • getline
  • GETURL
  • GNU/Linux, GNU/Linux, GNU/Linux
  • GNUPLOT, GNUPLOT
  • GUI, GUI, GUI
  • h

  • HTML
  • HTTP
  • HTTP core logic
  • Humphrys, Mark
  • i

  • image format, image format
  • l

  • Lisp
  • Loebner, Hugh
  • Loui, Ronald P.
  • m

  • MAZE
  • Microsoft Windows, Microsoft Windows, Microsoft Windows
  • MiniSQL
  • MOBAGWHO
  • n

  • network, network, network
  • p

  • PANIC
  • Perl, Perl
  • POP
  • PROLOG
  • r

  • RAW, RAW
  • REMCONF
  • Robbins, Arnold
  • robot, robot, robot
  • s

  • server, server
  • SMTP
  • SPAK
  • STATIST
  • t

  • Tcl/Tk, Tcl/Tk, Tcl/Tk
  • TCP, TCP, TCP
  • u

  • UDP, UDP, UDP
  • URLCHK
  • v

  • VRML
  • w

  • WEBGRAB
  • Weizenbaum, Joseph
  • WWW
  • |

  • |&

  • This document was generated on 27 December 1998 using the texi2html translator version 1.51a.