GAWK
The awk scripting language was originally developed as a
pattern-matching language for writing short programs to perform
data manipulation tasks. It was never meant to be used for networking
purposes. awkīs strength is the manipulation of textual data
that is stored in files. If we want to exploit its features in a
networking context, we have to use an access mode for network connections
that resembles the access of files as closely as possible. Therefore
we have the following three special files (listed in descending order
of importance). These files let us use the protocols of the same name
for establishing connections.
root
privileges to do so.
In this chapter, we will only demonstrate how to use the TCP protocol. The other protocols are much less important for most users (UDP) or even untractable (RAW).
awk is also meant to be a prototyping language. It is used
to demonstrate feasibility and to play with features and user interfaces.
This can be done with the above mentioned file-like handling of network
connections. For the convenience of a simple handling, we trade the lack
of many of the advanced features of the TCP/IP family of protocols. Such
features are available when programming in C or Perl. In fact, what we
do in this chapter is very similar to what is described in books like
Internet Programming with Python and
Advanced Perl Programming or
Web Client Programming with Perl.
But we do it without first learning object oriented ideology, underlying
languages like Tcl/Tk, Perl, Python and all the libraries necessary to
extend these languages before they are ready for the Internet.
Let us observe a network connection at work. Type in the following program and watch the output. Within a second it connects via TCP (`/inet/tcp') to the machine it is running on (`localhost') and asks the service `daytime' on the machine what time it is.
BEGIN {
"/inet/tcp/0/localhost/daytime" |& getline
print $0
}
Even experienced awk users will find the second line strange in two
respects:
getline. One would rather expect to see the special file
being read like any other file (getline <
"/inet/tcp/0/localhost/daytime").
|& has not been part of any awk
implementation up to now. It is actually the only extension of the awk
language needed (apart from the special files) to introduce network access.
Arnold Robbins decided to introduce the |& operator in order to
overcome the crucial restriction that access to files and pipes in
awk is always uni-directional. It was formerly impossible to use
both access modes on the same file or pipe. Instead of changing the whole
concept of file access, he decided to introduce the |& operator
which behaves exactly like the usual pipe operator except for two additions:
GAWK program with a |&
pipe can be accessed bi-directionally. The |& turned out to be a quite
general, useful and natural extension of awk.
What happens in this program ? The operator |& tells getline
to read a line from the special file `/inet/tcp/0/localhost/daytime'.
We could also have printed a line into the special file. But instead we just
read a line with the time, printed it and closed the connection by finishing
the program.
It may well be that for some reason the above program does not run on your
machine. When looking at possible reasons for this you will learn much
about typical problems that arise in network programming. First of all
your implementation of GAWK may not support network access because it is
a pre 3.10 version or you do not have a network interface in your machine.
Perhaps your machine uses some other protocol
like DECnet or Novellīs IPX. For the rest of this chapter we will assume
you work on a Unix machine that supports TCP/IP. If the above program does
not run on such a machine, it could help to replace the name
`localhost' with the name of your machine or its IP address. If it
does, you could replace `localhost' with the name of another machine
in your vicinity. This way, the program connects to another machine.
Now you should see the date and time being printed by the program.
Otherwise your machine may not support the `daytime' service.
Try changing the service to `chargen' or `ftp'. This way, the program
connects to other services that should give you some response. If you are
curious, you should have a look at your file `/etc/services'. It could
look like this:
| Service number
| |
| 7/tcp | echo sends back each line it receivces
|
| 7/udp | echo is good for testing purposes
|
| 9/tcp | discard behaves like `/dev/null'
|
| 9/udp | discard just throws away each line
|
| 13/tcp | daytime sends date & time once per connection
|
| 13/udp
| |
| 19/tcp | chargen infinitely produces character sets
|
| 19/udp | chargen is good for testing purposes
|
| 21/tcp | ftp is the usual file transfer protocol
|
| 23/tcp | telnet is the usual login facility
|
| 25/tcp | smtp is the Simple Mail Transfer Protocol
|
| 79/tcp | finger tells you who is logged in
|
| 80/tcp | www is the HyperText Transfer Protocol
|
| 109/tcp | pop2 is an older version of pop3
|
| 109/udp
| |
| 110/tcp | pop3 is the Post Office Protocol
|
| 110/udp | pop3 is used for receiving email
|
| 119/tcp | nntp is the USENET News Transfer Protocol
|
| 194/tcp | irc is the Internet Relay Chat
|
| 194/udp |
GAWK on
Microsoft Windows.
The first column of the file holds the name of the service,
the second a unique number and the protocol that one can use to connect to
this service. You see that some services (`echo') support TCP as
well as UDP.
The next program makes use of the possiblity to really interact with a
network service by printing something into the special file. It asks the so
called finger service if a user of the machine is logged in. When
testing this program, you should also try to change `localhost' to
some other machine name in your local network.
BEGIN {
NetService = "/inet/tcp/0/localhost/finger"
print "name" |& NetService
while ((NetService |& getline) > 0)
print $0
close(NetService)
}
After telling the service on the machine which user it is looking for,
the program repeatedly reads lines that come as a reply. When no more
lines are coming (because the service has closed the connection), the
program also closes by finishing. Try replacing name by your
login name or the name of someone else logged in. If you want a list
of all users currently logged in, replace name by an empty string
`'. You could safely delete the final close() command from
the above script because the operating system closes any open connection
by default when a script reaches the end of execution. In order to avoid
portability problems, we always close connections explicitly. With the
Linux kernel,
for example, proper closing results in flushing of buffers, while letting
the close() happen by default will result in discarding buffers.
In the early days of the Internet (until around 1992), you could use
such a program to check if some user in another country was logged in on
a specific machine.
RFC 1288
will give you the exact definition of the finger protocol.
Every contemporary Unix system also has a command named finger,
which functions as a client for the protocol of the same name.
Still today, some people maintain simple information systems
with this ancient protocol. For example, by typing
finger quake@seismo.unr.edu [..] DATE-(UTC)-TIME LAT LON DEP MAG COMMENTS yy/mm/dd hh:mm:ss deg. deg. km 98/12/14 21:09:22 37.47N 116.30W 0.0 2.3Md 76.4 km S of WARM SPRINGS, NEVA 98/12/14 22:05:09 39.69N 120.41W 11.9 2.1Md 53.8 km WNW of RENO, NEVADA 98/12/15 14:14:19 38.04N 118.60W 2.0 2.3Md 51.0 km S of HAWTHORNE, NEVADA 98/12/17 01:49:02 36.06N 117.58W 13.9 3.0Md 74.9 km SE of LONE PINE, CALIFOR 98/12/17 05:39:26 39.95N 120.87W 6.2 2.6Md 101.6 km WNW of RENO, NEVADA 98/12/22 06:07:42 38.68N 119.82W 5.2 2.3Md 50.7 km S of CARSON CITY, NEVAD
you get the latest Earthquake Bulletin for the state of Nevada.
It contains time, location, depth, magnitude and a short comment about
the earth quakes registered in that region during the last 10 days.
In many places today the use of such services is restricted
because most networks have firewalls and proxy servers between themselves
and the Internet. Most firewalls are programmed not to let such a
finger request go beyond the local network.
Another (ab)use of the finger protocol are several Coke machines
that are connected to the Internet. There is a short list of such
Coke machines.
You can access them either from the command line or with a simple
GAWK script. They will usually tell you about the different
flavours of Coke and Beer available there. If you have an account there,
you can even order some drink this way.
When looking at `/etc/services' you may have noticed that the `daytime' service is also available with `udp'. In the example above change `tcp' to `udp' and `finger' to `daytime'. After starting the example, you will see the expected day and time message and then the program hangs because it waits for more lines coming from the service, but they never come. This behaviour is a consequence of the differences between TCP and UDP. When using UDP, neither party is automatically being informed about the other closing the connection. When going on in experimenting this way you will experience many other subtle differences between TCP and UDP. To avoid such trouble one should always remember the advice Comer & Stevens give in volume III of their series Internetworking With TCP (page 14):
When designing client-server applications, beginners are strongly advised to use TCP because it provides reliable, connection-oriented communication. Programs only use UDP if the application protocol handles reliability, the application requires harware broadcast or multicast, or the application cannot tolerate virtual circuit overhead.
The preceding programs behaved as clients which connected to a server somewhere
on the net and requested a certain service. Now we will set up such a
service ourselves which mimics the behaviour of the daytime service.
Such a server does not know in advance, who is going to connect to it over
the network. Therefore we cannot insert a name for the host to connect to
in our special file name. Start the following program in one window. Notice
that our service does not have the name daytime but the number 8888.
From looking at `/etc/services' you know that names like daytime
are just mnemonics for some 16 bit integers. Only root could enter
our new service into the `/etc/services' with an appropriate name.
Also notice that the service name has to be entered into a different field
of the special file name because here we set up a server - not a client.
BEGIN {
print strftime() |& "/inet/tcp/8888/0/0"
}
Now open another window (on the same machine) and start the client program
that was given in the first example of this chapter. But before starting it,
be sure to also change the name daytime to 8888. After starting the
changed client program, you get a reply like this
Sat Sep 27 19:08:16 CEST 1997
and both programs have closed the connections now by terminating themselves.
Now we will intentionally make a mistake to see what happens when the name
8888 (the so called port) is already used by another service. Start the server
program in both windows. The first one will do, but the second one will
complain that it could not open the connection. Each port on a certain
machine can only be used by one server program at a time. Now terminate the
server program and change the name 8888 to echo. After restarting it,
the server program will not run any more and you know why: There already is
an echo service on your machine running. But even if there was no
echo service already running, on a Unix machine you would not get
your own echo server running, because the ports with numbers smaller
than 1024 (echo is at port 7) are reserved for root. On
machines running some flavour of Microsoft Windows, there is no restriction
that reserves ports 1 to 1024 for a privileged user and hence you can start
one echo server there.
Turning this short server program into something really useful is simple. Imagine a server that first reads a file name from the client through the network connection, then does something with the file and finally sends the result back to the client. The server side processing could be
BEGIN {
NetService = "/inet/tcp/8888/0/0"
NetService |& getline
CatPipe = "cat " $1
while ((CatPipe | getline) > 0)
print $0 |& NetService
}
and we would
have a remote copying facility. Such a server reads the name of a file
from any client that connects to it and transmits the contents of the
named file across the net. The server side processing could also be
the execution of a command that was transmitted across the network. By this
example you see how simple it is to open up a security hole on your
machine. If you really allowed clients to connect to your machine and
execute arbitrary commands, everyone would be free to do rm -rf *.
Have you ever wondered what your Netscape or Pine email client does when it retrieves your email from the email server ? In this section we will see. The distribution of email is usually done by dedicated email servers that communicate to your machine with special protocols. To receive email, we will use the Post Office Protocol (POP) which is defined in RFC1939. Sending can be done with the much older Simple Mail Transfer Protocol (SMTP) which is defined in RFC821, see RFCs in HTML.
When you type in the following program, replace the emailhost by the
name of your local email server. Ask your administrator if the server has a
POP service and then replace its name or number in the program below.
Now the program is ready to connect to your email server but it will not
succeed in retrieving your mail because it does not know yet your login
name and your password. Replace them in the program and the program will
show you the first email the server has in store.
BEGIN {
POPService = "/inet/tcp/0/emailhost/pop3"
RS = ORS = "\r\n"
print "user name" |& POPService
POPService |& getline
print "passpassword" |& POPService
POPService |& getline
print "retr 1" |& POPService
POPService |& getline
if ($1 != "+OK") exit
print "quit" |& POPService
RS = "\r\n.\r\n";
POPService |& getline;
print $0
}
The record separators RS and ORS are redefined because the
protocol (POP) requires it to separate lines this way. After identifying
yourself to the email service, the command retr 1 instructs the
service to send the first of all your emails in line. If the service
replies with something other than +OK, the program exits; maybe there
is no email. Otherwise the program first announces that it intends to finish
reading email and finally redefines RS in order to read the entire
email as multiline input in one record. From RFC1939 we know that the body
of the email always ends with a single line containing a single dot.
You can invoke this program as often as you like; it will not delete the
message it read, but leave it on the server.
Could it be that retrieving a web page from a web server is as simple as retrieving an email from an email server ? Yes, it is. We only have to use a different but quite similar protocol and a different port. The name of the protocol is HyperText Transfer protocol (HTTP) and the port number usually is 80. As in the preceding section, ask your administrator about the name of your local web server or proxy web server and its port number for HTTP requests. More detailed information about HTTP can be found at the home of the web protocols including the specification of HTTP in RFC2068. The only book about this protocol that I know can be found at http://www.browsebooks.com/Hethmon/?882.
The following program employs a rather crude approach toward retrieving a
web page because it uses the prehistoric syntax of HTTP 0.9 which almost all
web servers still support. The most noticeable thing about it is that the
program directs the request to the local proxy server whose name you insert
in the special file name (which in turn calls www.yahoo.com).
BEGIN {
RS = ORS = "\r\n"
HttpService = "/inet/tcp/0/proxy/80"
print "GET http://www.yahoo.com" |& HttpService
while ((HttpService |& getline) > 0) print $0
}
Here, again, lines are separated by a redefined RS and ORS.
The GET request that we send to the server is the only kind of
HTTP request that existed when the web was created in the early 90s. The
HTTP names this
GET request a "method" that will tell the service to transmit a web page
(here the home page of the Yahoo search engine). Version 1.0 (RFC1945) added
the request methods HEAD and POST. The current version of HTTP
is 1.1 and knows the additional request methods OPTIONS, PUT, DELETE,
and TRACE.
You can fill in any valid web address and the program will print the
HTML code of that page onto your screen. Notice the similarity between
the response of the POP service and the HTTP service. First you get a
header that is terminated by an empty line and then you get the body of
the page in HTML.
The lines of the headers also have the same form as in POP. First there is
the name of a parameter, then a colon and finally the value of that parameter.
You can also retrieve GIF files this way, but you will get binary data then, that should be redirected into a file. Another application is calling a CGI script on some server. CGI scripts are used when the contents of a web page is not constant but generated instantly at the moment you send a request for it. For example, to get a detailed report about the current quotes of the Motorola stock shares, call a CGI script at Yahoo with
print "GET http://quote.yahoo.com/q?s=MOT&d=t" |& HttpService
You could also call for weather reports this way. A good book to go on with is the HTML Source Book. There are also some books on CGI programming like the one by Thomas Boutell and this one. Another good source is The CGI Resource Index.
Now we know enough about HTTP to set up a primitive web service that just
says "Hello world" when someone connects to it with a Netscape browser.
Compared
to the situation in the preceding section, our program changes the role. It
tries to behave just like the server we have observed. Since we are setting
up a server here, we have to insert the port number in the localport
field of the special file name. The other two fields (hostname and
remoteport) have to contain a 0 because we do not know in
advance which host will connect to our service.
In the early 90s all a server had to do was to send an HTML document and
close the connection. Here we will adhere to the modern syntax of HTTP
and first send a status line telling the Netscape browser that everything
is OK. Then we send a line to tell the browser how many bytes follow in the
body of the message. This was not necessary in the early 90s because both
parties knew that the document ended when the connection closed. Nowadays
it is possible to stay connected after the transmission of one web page.
This is to avoid the network traffic necessary for repeatedly establishing
TCP connections for requesting several images. Therefore the need to tell
the receiving party how many bytes will be sent. The header is terminated
as usual with an empty line. Finally we send the "Hello world" body
in HTML. The useless while loop swallows the request of the browser.
We could actually omit the loop and on most machines the program would still
work. To check this one out, first start the following program.
BEGIN {
RS = ORS = "\r\n"
HttpService = "/inet/tcp/8080/0/0"
Hello = "<HTML><H1>Hello world</H1></HTML>"
print "HTTP/1.0 200 OK" |& HttpService
print "Content-Length: " length(Hello) + length(ORS) ORS |& HttpService
print Hello |& HttpService
while ((HttpService |& getline) > 0) ;
}
Now (on the same machine) start your favourite browser and let it point to
http://localhost:8080. You see, the browser needs to know which port
our server is listening at for requests. If this does not work, the browser
probably tries to connect to a proxy server which does not know your machine.
Then, change the browser's configuration so that the browser will not try to
use a proxy to connect to your machine.
Setting up a web service that allows user interaction is more difficult and
leads us to the limits of network access in GAWK. In this section,
we will develop a main program (a BEGIN pattern and its action)
that will become the core of event-driven execution controlled by a GUI.
Each HTTP event that the user triggers by some action within the browser
will be received in this central procedure. Parameters and menue choices are
extracted from this request and an appropriate measure taken according to
the user's choice.
BEGIN {
if (MyHost == "") { "uname -n" | getline MyHost; close("uname -n") }
if (MyPort == 0) MyPort = 8080
HttpService = "/inet/tcp/" MyPort "/0/0"
MyPrefix = "http://" MyHost ":" MyPort
SetUpServer()
while ("GAWK" < "Perl") {
RS = ORS = "\r\n" # header lines are terminated this way
Status = 200 # this means OK
Reason = "OK"
Header = TopHeader
Document = TopDoc
Footer = TopFooter
if (GETARG["Method"] == "GET") { HandleGET()
} else if (GETARG["Method"] == "HEAD") { # not yet implemented
} else if (GETARG["Method"]!=""){print "bad method",GETARG["Method"]
}
Prompt = Header Document Footer
print "HTTP/1.0", Status, Reason |& HttpService
print "Content-length:", length(Prompt) + length(ORS) |& HttpService
print ORS Prompt |& HttpService
while ((HttpService |& getline) > 0) ; # ignore all the header lines
close(HttpService) # stop talking to this client
HttpService |& getline # wait for new client request
print systime(), strftime(), $0 # do some logging
delete GETARG; delete MENUE; delete PARAM
GETARG["Method"]=$1; GETARG["URI"]=$2; GETARG["Version"]=$3
for (i=length($2); (substr($2,i,1)!="?") && (i>0); i--) ;
if (i > 0) { # is there a "?" indicating a CGI request ?
split(substr($2, 1, i-1), MENUE, "[/:]")
split(substr($2, i+1), PARAM, "&")
for (i in PARAM) {
j = index(PARAM[i], "=")
GETARG[substr(PARAM[i], 1, j-1)] = substr(PARAM[i], j+1)
}
} else { # there is no "?", no need for splitting PARAMs
split($2, MENUE, "[/:]")
}
}
}
This web server presents menue choices in the form of HTML links.
Therefore, it has to tell the browser the name of the host it is
residing on. When starting the server, the user may supply the name
of the host from the command line with gawk -vMyHost="Rumpelstilzchen".
If he does not do this, the server looks up the name of the host it is
running on for later use as a web address in HTML documents. The same
applies to the port number. These values will later be inserted into the
HTML content of the web pages to refer to the home system.
Each server that we will build around this core has to initialize some
application dependent variables (like the default home page) in a procedure
SetUpServer(), which will be called immediately before entering the
infinite loop of the server. For now, we will write an instance that
initiates a trivial interaction. With this home page, the client user
can click on two possible choices and will get the current date either
in human readable format or in seconds since 1970.
function SetUpServer() {
TopHeader = "<HTML><title>My name is GAWK, GNU AWK</title></HEAD>"
TopDoc = "<h2>\
Do you prefer your date human or\
POSIXed ?</h2>" ORS ORS
TopFooter = "</BODY></HTML>"
}
On the first run through the main loop, the default line terminators are
set and the default home page is copied to the actual home page. Since this
is the first run, GETARG["Method"] is not initialized yet, hence the
case selection over the method does nothing. Now that the home page is
initialized, the server can start communicating to a client browser.
Having supplied the initial home page to the browser with a valid document
stored in the parameter Prompt, it closes the connection and waits
for the next request. When the request comes, a log line is printed that
allows us to see which request the server receives. Then, all variables for
global storage of request parameters are first cleared and filled with the
extracted new values. Next, the name of the requested resource is split into
parts and stored for later evaluation. If the request contains a ?,
then the request has CGI variables seamlessly appended to the web address.
Everything in front of the ? is split up into menue items and
everything behind the ? is a list of variable=value pairs
(separated by &) that also need splitting. This way, CGI variables are
isolated and stored. This procedure lacks recognition of special characters
that are transmitted in coded form (as defined in RFC2068). Here, any
optional request header and body parts is ignored. We do not need
header parameters and the request body, but when refining our approach or
working with the POST and PUT methods, also reading the header and body will
become inevitable. Header parameters should then be stored in a global
variable as well as the body.
On each subsequent run through the main loop, one request from a browser is
received, evaluated and answered according to the user's choice. This can be
done by letting the value of the HTTP method guide the main loop into
execution of the procedure HandleGET() which evaluates the user's
choice. In this case, we have only one hierarchical level of menues, but
menues are nested in the general case. The menue choices at each level are
separated by / just like in file names. Notice how simple it is to
construct menues of arbitrary depth.
function HandleGET() {
if( MENUE[2] == "human") {
Footer = strftime() TopFooter
} else if (MENUE[2] == "POSIX") {
Footer = systime() TopFooter
}
}
The main disadvantage of this approach is that our server is slow and can
handle only one request at a time. Its main advantage is that the server
consists of just one GAWK program. No need for installing an
`httpd', no need for static separate HTML files, CGI scripts or
root privileges. This is rapid prototyping.
Start this program on the same host that runs your browser. Then let your
browser point at http://localhost:8080.
It is also possible to include images into the HTML pages. Most browsers
support the not very well known
`.xbm'
format, which may contain only
monochrome pictures but is an ASCII format. Binary images are possible but
not so easy to handle. Another way of including images is to generate them
with a tool like
GNUPLOT
by calling the tool with the system() call or through a pipe.
In the preceding section, we built the core logic for event driven GUIs.
In this section, we will finally extend the core to a real application.
No one would actually write a commercial web server in GAWK but
it is instructive to see that it is feasible in principle.
The application is ELIZA, the famous program by Joseph Weizenbaum that
mimics the behaviour of a professional psychotherapist when talking to you.
Weizenbaum would certainly object to this description, but this is part of
the legend around ELIZA.
Take the site independent core logic and append the following code.
You will recognize that SetUpServer() is similar to the example
above, except for another function SetUpEliza() being called.
You can use this approach to implement other kinds of servers.
The only changes needed to do so are hidden in the functions
SetUpServer() and HandleGET(). Perhaps you also want to
implement other HTTP-methods.
When extending this example to a complete application, the first
thing to do is to implement the function SetUpServer() that
initializes the HTML pages and some variables. These initializations
determine the way your HTML pages will look (colours, titles, menue
items etc.).
function SetUpServer() {
SetUpEliza()
TopHeader = "<HTML><title>An HTTP-based System with GAWK</title>\
<HEAD><META HTTP-EQUIV=\"Content-Type\"\
CONTENT=\"text/html; charset=iso-8859-1\"></HEAD>\
<BODY BGCOLOR=\"#ffffff\" TEXT=\"#000000\" LINK=\"#0000ff\"\
VLINK=\"#0000ff\" ALINK=\"#0000ff\"> "
TopDoc = "\
<h2>Please choose one of the following actions:</h2>\
<UL>\
<LI>About this server\
<LI>About Eliza\
<LI>Start talking to Eliza\
</UL>"
TopFooter = "</BODY></HTML>"
}
The function HandleGET() decides which page the user wants to see
next. It is a nested case selection. Each nesting level refers to a menue
level of the GUI. Each case implements a certain action of the menue. On the
deepest level of case selection the handler essentially knows what the
user wants and stores the answer into a variable that holds the HTML
page contents.
function HandleGET() {
# A real HTTP-server would treat some parts of the URI as a file name.
# We take parts of the URI as menu choices and go on accoringly.
if(MENUE[2] =="AboutServer") {
Document = "This is not a CGI script.\
This is an httpd, an HTML-file and a CGI script all in one GAWK script.\
It needs no separate www-server, no installation and no root privileges.\
<br><br>To run it, do this:<br><ul>\
<li> start this script with \"gawk -f httpserver.awk\",<br>\
<li> and on the same host let your www browser open location\
\"http://localhost:8080\"\
</ul>\<br>\ Details of HTTP come from:<br><ul>\
<li>Hethmon: Illustrated Guide to HTTP<br>\
<li>RFC 2068<br></ul><br>JK 14.9.97<br>"
} else if (MENUE[2]=="AboutELIZA") {
Document = "This is an implementation of the famous ELIZA program\
by Joseph Weizenbaum. It is written in GAWK and uses\
an HTML GUI."
} else if (MENUE[2]=="StartELIZA") {
gsub(/\+/, " ", GETARG["YouSay"])
# Here we also have to substitute coded special characters
Document = "<form method=GET><h3>" ElizaSays(GETARG["YouSay"]) "</h3>\
<br><input type=text name=YouSay value=\"\" size=60>\
<br><input type=submit value=\"Tell her about it\"> </form>"
}
}
Now we are down at the heart of ELIZA. Here you can see how it works. Initially the user does not say anything; then ELIZA resets its money counter and just asks the user to tell open heartedly what comes to mind. The subsequent answers are first converted to upper case and stored for later comparison. ELIZA will present the bill when being confronted with a sentence that contains the phrase "shut up". Otherwise it looks for keywords in the sentence, conjugates the rest of the sentence, remembers the keyword for later use and finally selects an answer from the set of possible answers.
function ElizaSays(YouSay) {
if (YouSay == "") {
cost=0
answer = "HI, IM ELIZA, TELL ME YOUR PROBLEM"
} else {
q = toupper(YouSay)
gsub("'", "", q)
if(q == qold) {
answer = "PLEASE DONT REPEAT YOURSELF !"
} else {
if(index(q, "SHUT UP")>0) {
answer = "WELL, PLEASE PAY YOUR BILL. ITS EXACTLY ... $"\
int(100*rand()+30+cost/100)
} else {
qold = q
w="-" # no keyword recognized yet
for (i in k) { # search for keywords
if(index(q, i) > 0) {
w=i;
break
}
}
if (w == "-") { # no keyword, take old subject
w = wold;
subj = subjold
} else { # find subject
subj = substr(q, index(q, w) + length(w)+1)
wold = w;
subjold = subj # remember keyword and subject
}
for (i in conj) gsub(i, conj[i], q) # conjugation
# from all answers to this keyword, select one randomly
answer = r[indices[int(split(k[w], indices) * rand()) + 1]]
# insert subject into answer
gsub("_", subj, answer)
}
}
}
cost += length(answer) # for later payment : 1 cent per character
return answer
}
In the long but simple function SetUpEliza() you can see tables
for conjugation, keywords and answers. The associative array k[]
contains indizes into the array of answers r[]. To choose an
answer, ELIZA just picks an index randomly.
function SetUpEliza() {
srand()
wold = "-"
subjold = " "
# table for conjugation
conj[" ARE " ] = " AM "
conj["WERE " ] = "WAS "
conj[" YOU " ] = " I "
conj["YOUR " ] = "MY "
conj[" IVE " ] =\
conj[" I HAVE " ] = " YOU HAVE "
conj[" YOUVE " ] =\
conj[" YOU HAVE "] = " I HAVE "
conj[" IM " ] =\
conj[" I AM " ] = " YOU ARE "
conj[" YOURE " ] =\
conj[" YOU ARE " ] = " I AM "
# table of all answers
r[1] = "DONT YOU BELIEVE THAT I CAN _"
r[2] = "PERHAPS YOU WOULD LIKE TO BE ABLE TO _ ?"
r[3] = "YOU WANT ME TO BE ABLE TO _ ?"
r[4] = "PERHAPS YOU DONT WANT TO _ "
r[5] = "DO YOU WANT TO BE ABLE TO _ ?"
r[6] = "WHAT MAKES YOU THINK I AM _ ?"
r[7] = "DOES IT PLEASE YOU TO BELIEVE I AM _ ?"
r[8] = "PERHAPS YOU WOULD LIKE TO BE _ ?"
r[9] = "DO YOU SOMETIMES WISH YOU WERE _ ?"
r[10] = "DONT YOU REALLY _ ?"
r[11] = "WHY DONT YOU _ ?"
r[12] = "DO YOU WISH TO BE ABLE TO _ ?"
r[13] = "DOES THAT TROUBLE YOU ?"
r[14] = "TELL ME MORE ABOUT SUCH FEELINGS"
r[15] = "DO YOU OFTEN FEEL _ ?"
r[16] = "DO YOU ENJOY FEELING _ ?"
r[17] = "DO YOU REALLY BELIEVE I DONT _ ?"
r[18] = "PERHAPS IN GOOD TIME I WILL _ "
r[19] = "DO YOU WANT ME TO _ ?"
r[20] = "DO YOU THINK YOU SHOULD BE ABLE TO _ ?"
r[21] = "WHY CANT YOU _ ?"
r[22] = "WHY ARE YOU INTERESTED IN WHETHER OR NOT I AM _ ?"
r[23] = "WOULD YOU PREFER IF I WERE NOT _ ?"
r[24] = "PERHAPS IN YOUR FANTASIES I AM _ "
r[25] = "HOW DO YOU KNOW YOU CANT _ ?"
r[26] = "HAVE YOU TRIED ?"
r[27] = "PERHAPS YOU CAN NOW _ "
r[28] = "DID YOU COME TO ME BECAUSE YOU ARE _ ?"
r[29] = "HOW LONG HAVE YOU BEEN _ ?"
r[30] = "DO YOU BELIEVE ITS NORMAL TO BE _ ?"
r[31] = "DO YOU ENJOY BEING _ ?"
r[32] = "WE WERE DISCUSSING YOU -- NOT ME"
r[33] = "Oh, I _"
r[34] = "YOU'RE NOT REALLY TALKING ABOUT ME, ARE YOU ?"
r[35] = "WHAT WOULD IT MEAN TO YOU, IF YOU GOT _ ?"
r[36] = "WHY DO YOU WANT _ ?"
r[37] = "SUPPOSE YOU SOON GOT _"
r[38] = "WHAT IF YOU NEVER GOT _ ?"
r[39] = "I SOMETIMES ALSO WANT _"
r[40] = "WHY DO YOU ASK ?"
r[41] = "DOES THAT QUESTION INTEREST YOU ?"
r[42] = "WHAT ANSWER WOULD PLEASE YOU THE MOST ?"
r[43] = "WHAT DO YOU THINK ?"
r[44] = "ARE SUCH QUESTIONS IN YOUR MIND OFTEN ?"
r[45] = "WHAT IS IT THAT YOU REALLY WANT TO KNOW ?"
r[46] = "HAVE YOU ASKED ANYONE ELSE ?"
r[47] = "HAVE YOU ASKED SUCH QUESTIONS BEFORE ?"
r[48] = "WHAT ELSE COMES TO MIND WHEN YOU ASK THAT ?"
r[49] = "NAMES DON'T INTEREST ME"
r[50] = "I DONT CARE ABOUT NAMES -- PLEASE GO ON"
r[51] = "IS THAT THE REAL REASON ?"
r[52] = "DONT ANY OTHER REASONS COME TO MIND ?"
r[53] = "DOES THAT REASON EXPLAIN ANYTHING ELSE ?"
r[54] = "WHAT OTHER REASONS MIGHT THERE BE ?"
r[55] = "PLEASE DON'T APOLOGIZE !"
r[56] = "APOLOGIES ARE NOT NECESSARY"
r[57] = "WHAT FEELINGS DO YOU HAVE WHEN YOU APOLOGIZE ?"
r[58] = "DON'T BE SO DEFENSIVE"
r[59] = "WHAT DOES THAT DREAM SUGGEST TO YOU ?"
r[60] = "DO YOU DREAM OFTEN ?"
r[61] = "WHAT PERSONS APPEAR IN YOUR DREAMS ?"
r[62] = "ARE YOU DISTURBED BY YOUR DREAMS ?"
r[63] = "HOW DO YOU DO ... PLEASE STATE YOUR PROBLEM"
r[64] = "YOU DON'T SEEM QUITE CERTAIN"
r[65] = "WHY THE UNCERTAIN TONE ?"
r[66] = "CAN'T YOU BE MORE POSITIVE ?"
r[67] = "YOU AREN'T SURE ?"
r[68] = "DON'T YOU KNOW ?"
r[69] = "WHY NO _ ?"
r[70] = "DON'T SAY NO, IT'S ALWAYS SO NEGATIVE"
r[71] = "WHY NOT ?"
r[72] = "ARE YOU SURE ?"
r[73] = "WHY NO ?"
r[74] = "WHY ARE YOU CONCERNED ABOUT MY _ ?"
r[75] = "WHAT ABOUT YOUR OWN _ ?"
r[76] = "CAN'T YOU THINK ABOUT A SPECIFIC EXAMPLE ?"
r[77] = "WHEN ?"
r[78] = "WHAT ARE YOU THINKING OF ?"
r[79] = "REALLY, ALWAYS ?"
r[80] = "DO YOU REALLY THINK SO ?"
r[81] = "BUT YOU ARE NOT SURE YOU _ "
r[82] = "DO YOU DOUBT YOU _ ?"
r[83] = "IN WHAT WAY ?"
r[84] = "WHAT RESEMBLANCE DO YOU SEE ?"
r[85] = "WHAT DOES THE SIMILARITY SUGGEST TO YOU ?"
r[86] = "WHAT OTHER CONNECTION DO YOU SEE ?"
r[87] = "COULD THERE REALLY BE SOME CONNECTIONS ?"
r[88] = "HOW ?"
r[89] = "YOU SEEM QUITE POSITIVE"
r[90] = "ARE YOU SURE ?"
r[91] = "I SEE"
r[92] = "I UNDERSTAND"
r[93] = "WHY DO YOU BRING UP THE TOPIC OF FRIENDS ?"
r[94] = "DO YOUR FRIENDS WORRY YOU ?"
r[95] = "DO YOUR FRIENDS PICK ON YOU ?"
r[96] = "ARE YOU SURE YOU HAVE ANY FRIENDS ?"
r[97] = "DO YOU IMPOSE ON YOUR FRIENDS ?"
r[98] = "PERHAPS YOUR LOVE FOR FRIENDS WORRIES YOU"
r[99] = "DO COMPUTERS WORRY YOU ?"
r[100] = "ARE YOU TALKING ABOUT ME IN PARTICULAR ?"
r[101] = "ARE YOU FRIGHTENED BY MACHINES ?"
r[102] = "WHY DO YOU MENTION COMPUTERS ?"
r[103] = "WHAT DO YOU THINK MACHINES HAVE TO DO WITH YOUR PROBLEMS ?"
r[104] = "DON'T YOU THINK COMPUTERS CAN HELP PEOPLE ?"
r[105] = "WHAT IS IT ABOUT MACHINES THAT WORRIES YOU ?"
r[106] = "SAY, DO YOU HAVE ANY PSYCHOLOGICAL PROBLEMS ?"
r[107] = "WHAT DOES THAT SUGGEST TO YOU ?"
r[108] = "I SEE"
r[109] = "IM NOT SURE I UNDERSTAND YOU FULLY"
r[110] = "COME COME ELUCIDATE YOUR THOUGHTS"
r[111] = "CAN YOU ELABORATE ON THAT ?"
r[112] = "THAT IS QUITE INTERESTING"
r[113] = "WHY DO YOU HAVE PROBLEMS WITH MONEY ?"
r[114] = "DO YOU THINK MONEY IS EVERYTHING ?"
r[115] = "ARE YOU SURE THAT MONEY IS THE PROBLEM ?"
r[116] = "I THINK WE WANT TO TALK ABOUT YOU, NOT ABOUT ME"
r[117] = "WHAT'S ABOUT ME ?"
r[118] = "WHY DO YOU ALWAYS BRING UP MY NAME ?"
# table for looking up answers that fit to a certain keyword
k["CAN YOU"] = "1 2 3"
k["CAN I"] = "4 5"
k["YOU ARE"] =\
k["YOURE"] = "6 7 8 9"
k["I DONT"] = "10 11 12 13"
k["I FEEL"] = "14 15 16"
k["WHY DONT YOU"] = "17 18 19"
k["WHY CANT I"] = "20 21"
k["ARE YOU"] = "22 23 24"
k["I CANT"] = "25 26 27"
k["I AM"] =\
k["IM "] = "28 29 30 31"
k["YOU "] = "32 33 34"
k["I WANT"] = "35 36 37 38 39"
k["WHAT"] =\
k["HOW"] =\
k["WHO"] =\
k["WHERE"] =\
k["WHEN"] =\
k["WHY"] = "40 41 42 43 44 45 46 47 48"
k["NAME"] = "49 50"
k["CAUSE"] = "51 52 53 54"
k["SORRY"] = "55 56 57 58"
k["DREAM"] = "59 60 61 62"
k["HELLO"] =\
k["HI "] = "63"
k["MAYBE"] = "64 65 66 67 68"
k[" NO "] = "69 70 71 72 73"
k["YOUR"] = "74 75"
k["ALWAYS"] = "76 77 78 79"
k["THINK"] = "80 81 82"
k["LIKE"] = "83 84 85 86 87 88 89"
k["YES"] = "90 91 92"
k["FRIEND"] = "93 94 95 96 97 98"
k["COMPUTER"] = "99 100 101 102 103 104 105"
k["-"] = "106 107 108 109 110 111 112"
k["MONEY"] = "113 114 115"
k["ELIZA"] = "116 117 118"
}
Some interesting remarks and details (including the original source code of ELIZA) can be found on Mark Humphrys' home page. Yahoo also has a page with a collection of ELIZA-like programs. Many of them written in Java and some even supply source code.
Now, you have learned enough to build your own application. You could, for example, take part in the Loebner Contest to win $100.000. The Loebner Prize is the first formal instantiation of a Turing Test. Hugh Loebner agreed with The Cambridge Center for Behavioral Studies to underwrite a contest designed to implement the Turing Test. Dr. Loebner pledged a Grand Prize of $100,000 for the first computer whose responses were indistinguishable from a human's. Each year an annual prize of $2000 and a bronze medal is awarded to the most human computer. The winner of the annual contest is the best entry relative to other entries that year, irrespective of how good it is in an absolute sense. Here is an example of a conversation with the winning program of 1997:
(c)1997 Cambridge Center For Behavioral Studies all rights reserved [CONVERSE] [Intelligent Research Ltd.] Start at: [1997/04/29 08:59:40] PROGRAM[10:39:20]Hi, my name is Catherine, what's yours? JUDGE02[10:39:36]My name is Nathan. PROGRAM[10:39:37]Well, it's nice to meet you, Nathan. PROGRAM[10:39:45]Did you see that story on CNN last night about the lesbian PROGRAM[10:40:03]couple who came out at a White House party on Sunday? JUDGE02[10:40:45]No. I'm sorry I missed that. How did the Prez take it? PROGRAM[10:40:47]Ellen Degeneres was one of them - she was kissing her lover PROGRAM[10:41:08]in front of Bill Clinton. Do you think he was very PROGRAM[10:41:21]embarrassed? JUDGE02[10:41:37]You tell me. I didn't see it. PROGRAM[10:41:38]Are you still a Friend of Bill Clinton? JUDGE02[10:42:22]Yes. I would like to talk about something else. Do you know anything about sculpture. PROGRAM[10:42:24]Well, I suppose he's doing his best like most of us, but he PROGRAM[10:42:35]does seem to attract an UNUSUAL amount of bad publicity, PROGRAM[10:42:51]wouldnt you say--can it all be bad luck? JUDGE02[10:43:20]I asked you if you knew anything about sculpture. PROGRAM[10:43:21]OK, but what happens now--does it all drift on, with PROGRAM[10:43:36]everyone waiting to see how much his friends have said about PROGRAM[10:43:47]him under threat of lots and lots of jail-or will something PROGRAM[10:44:01]dramatic actually happen? JUDGE02[10:44:19]Do you know what sculpture is? PROGRAM[10:44:20]Well, youre more a fan of his than I am I guess--though I PROGRAM[10:44:34]cant SWEAR what Id have done in that locked hotel room, I PROGRAM[10:44:51]must say. JUDGE02[10:45:26]Have you ever been to the Salmagundi Club? PROGRAM[10:45:29]Never. to go back to the original discussion Time to get PROGRAM[10:45:43]off this one I think and onto something more down to earth! JUDGE02[10:46:20]What is more down to earth than art? PROGRAM[10:46:21]I don't know what -- to be honest. - PROGRAM[10:46:30]I suppose it depends on where you're coming from, but as the PROGRAM[10:46:46]song goes in My Fair Lady, "Why can't a woman be more like a PROGRAM[10:47:03]man? " JUDGE02[10:47:52]I liked that song but I prefer "Why can't children be more like we are" JUDGE02[10:48:10]from another play the name of which I cannot remember. PROGRAM[10:48:12]What, if anything, do you enjoy about being a man? JUDGE02[10:49:01]I would enjoy all aspects of life no matter what my gender. PROGRAM[10:49:02]Do you think of yourself as being attractive?
This program insists on always speaking about the same story around Bill
Clinton. You see, even a program with a rather narrow mind can behave so
much like a human being that it can win this prize. It is quite common to
let these programs talk to each other via network connections. But during the
competition itself, the program and its computer have to be present at the
place the competition is held. We all would love to see a GAWK
program win in such an event. Maybe it is up to you to accomplish this ?
Some other ideas for useful networked applications:
GAWK distribution.
It was written by Ronald P. Loui (loui@ai.wustl.edu, Associate Professor of
Computer Science, at Washington University in St. Louis) and summarizes why
he teaches GAWK to students of Artificial Intelligence. Here are
some passages from the text:
Now thatThe GAWK manual can be consumed in a single lab session and the language can be mastered by the next morning by the average student. GAWK's automatic initialization, implicit coercion, I/O support and lack of pointers forgive many of the mistakes that young programmers are likely to make. Those who have seen C but not mastered it are happy to see that GAWK retains some of the same sensibilities while adding what must be regarded as spoonsful of syntactic sugar. [..] There are further simple answers. Probably the best is the fact that increasingly, undergraduate AI programming is involving the Web. Oren Etzioni (University of Washington, Seattle) has for a while been arguing that the "softbot" is replacing the mechanical engineers' robot as the most glamorous AI testbed. If the artifact whose behavior needs to be controlled in an intelligent way is the software agent, then a language that is well-suited to controlling the software environment is the appropriate language. That would imply a scripting language. If the robot is KAREL, then the right language is "turn left; turn right." If the robot is Netscape, then the right language is something that can generate "netscape -remote 'openURL(http://cs.wustl.edu/~loui)'" with elan. [..] AI programming requires high-level thinking. There have always been a few gifted programmers who can write high-level programs in assembly language. Most however need the ambient abstraction to have a higher floor. [..] Second, inference is merely the expansion of notation. No matter whether the logic that underlies an AI program is fuzzy, probabilistic, deontic, defeasible, or deductive, the logic merely defines how strings can be transformed into other strings. A language that provides the best support for string processing in the end provides the best support for logic, for the exploration of various logics, and for most forms of symbolic processing that AI might choose to call "reasoning" instead of "logic." The implication is that PROLOG, which saves the AI programmer from having to write a unifier, saves perhaps two dozen lines of GAWK code at the expense of strongly biasing the logic and representational expressiveness of any approach.
GAWK itself can connect to the Internet, it should be obvious
that it is suitable to the problem of writing intelligent web agents.
awk is strong at pattern recognition and string processing.
So, it is well suited to the classic problem of language translation.
A first try could be a program that knows the 100 most frequent english
words and their counterparts in german or french. The service could be
implemented by regularly reading email with the program above, replacing
each word by its translation and sending the translation back via SMTP.
Users would send an english email to their translation service and get
back a translated email in return. As soon as this works, more effort can be
spent on a real translation program.
GAWK service that reads the email. It looks
for keywords in the mail and assembles a reply email accordingly. By carefully
investigating the email header also and repeating these keywords through the
reply email, it is rather simple to give the customer a feeling that
someone cares. Ideally, such a service would search a database of previous
cases for solutions. If none exists, the database could, for example, consist
of all the newsgroups, mailing lists and FAQs on the Internet.
By now you also should have noticed that debugging such a networked application is more complicated than debugging a single-process/single-hosted application. The behaviour of a networked application sometimes looks non-causal because it is not reproducible in a strong sense. Wether your network application works or not sometimes depends on
The most difficult problems for a beginner arise from hidden states of the
underlying network. After closing a TCP connection, you often have to wait
a short while before reopening the connection. Even more difficult is the
establishment of a connection that formerly ended with a "broken pipe".
Those connections have to "time out" for a minute before you can reopen the
connection. You can check this with the command netstat -a which
gives you a list of still "active" connections.
Data transmission between two parties over the network occurs synchronously
like in the rendez-vous concept. The party that comes first in a request
for transmission (read or write) will wait for the other to become ready
for transmission. But before going into the details, we must clarify the
meaning of the terms client and server. The terminology of this chapter will
look bizarre und incomprehensible to those who have never heard network
jargon before. In order to make clear what happens some terminology is
unavoidable. If TCP was absent from this description, things would be
much simpler. The main reason for the subtle distinctions we have to make
is that the system calls connect(), listen() and accept()
(used with TCP) have to be hidden from the user. UDP and RAW are simpler
during connection buildup but TCP is simpler afterwards.
After establishment of a connection, the application protocol (HTTP, POP3 or FTP) determines who is client, who is server and what each one is supposed to do. We will call this the application level client or server. Here, we do assume nothing about the intentions of programs accessing the special files. Furthermore, we assume that a connection has not yet been established. In the context of building up a connection, we use the term connection level client and server to denote the behaviour of a program during establishment of the connection. In this context we also have to distinguish between connection level client and server behaviour for each of the protocols TCP, UDP and RAW. In general, whoever comes first will be called the connection level server in this chapter. The other end will be called the connection level client. With TCP, the connection level server will block, no matter if reading or writing. UDP and RAW know only blocking reads. A connection level client will only block for reading with all protocols.
Establishing a connection will always put both ends into connected state in the sense of the socket API. This is true for TCP and UDP connections and implies that we always have point-to-point connections that enable exactly one party on one end to talk to exactly one party at the other end. Once being established, the connection between both will keep away other parties demanding connection. Only after closing a connection can a new one be built up. This is contrary to the usual behaviour of fully developed web servers which have to avoid situations in which they are not reachable. We have to pay this price in order to enjoy the benefits of a simple communication paradigm.
The special file name for network access is made up of several fields, all of them mandatory, none of them optional:
/inet/protocol/localport/hostname/remoteport
The inet field is of course constant when accessing the network.
The localport and remoteport fields do not have a meaning
when used with `/inet/raw' because "port" is a term only known to
TCP and UDP. So, when using `/inet/raw' the port fields always have
to be 0.
Here, we will explain the meaning, the range of values and the defaults for all other fields.
protocol field determines which member of the TCP/IP
family of protocols will be selected to transport the data across the
network. There are three possible values (always written lowercase):
tcp, udp and raw. The exact meaning of each is
explained below. There is no default to this field.
localport determines which port on the local
machine will be used to communicate across the network. It has no meaning
with RAW and must therefore be 0. Application level clients usually use 0 to
indicate they do not bother which local port is used. Instead they specify a
remote port to connect to. It is vital for application level servers to use
a number different from 0 here because their service has to be available at
a specific publicly known port number. It is possible to use a name from
/etc/services here.
No default value is assumed. The value 0 means "any".
hostname determines which remote host has to
be at the other end of the connection. Application level servers must fill
this field with a 0 to indicate their being open for all other hosts
to connect to and enforce connection level server behaviour this way.
It is not possible for an application level server to restrict his
availability to one remote host by entering a host name here.
Application level clients must enter a name different from 0 here.
The name can be either symbolic
(jpl-devvax.jpl.nasa.gov) or numeric (128.149.1.143).
No default value is assumed. The value 0 means "any" and enforces server
behaviour.
remoteport determines which port on the remote
machine will be used to communicate across the network. It has no meaning
with RAW and must therefore be 0. Application level clients must use a number
different from 0 here to indicate which port on the remote machine
they want to connect to. Application level servers must fill this field with
a 0. Instead they specify a local port for clients to connect to.
It is possible to use a name from /etc/services here.
No default value is assumed. The value 0 means "any" and enforces server
behaviour.
Experts in network programming will notice that the usual
client-server-asymmetry found at the level of the socket API is not visible
here. This is for the sake of simplicity of the high level concept. If you
really miss this, use another language like C or Perl. For GAWK it is
more important to enable users to write a client program with five lines of
code. What happens, when first accessing a network connection can be seen
in the following pseudo code:
if ((name of remote host given) && (other side accepts connection)) {
rendez-vous successful; transmit with getline or print
} else {
if ((other side did not accept) && (localport == 0))
exit unsuccessful
if (TCP) {
set up a server accepting connections
this means waiting for the client on the other side to connect
} else {
ready
}
}
The exact behaviour of this algorithm depends on the values of the fields of the special file name. When in doubt, use the following table that gives you the combinations of values and their meaning. If you think this table is too complicated, restrict yourself to the three lines printed in bold letters. All examples of the preceding chapter used only the patterns printed in bold letters.
| LOCAL PORT | HOST NAME | REMOTE PORT | RESULTING CONNECTION LEVEL BEHAVIOUR
|
| 0 | x | x |
dedicated client, fails if immediately connecting to a
server on the other side fails
|
| 0 | x | x | dedicated client
|
| 0 | x | 0 | dedicated client, works only as root
|
| x | x | x |
client, switches to dedicated server if necessary
|
| x | 0 | 0 |
dedicated server
|
| 0 | 0 | 0 | dedicated server, works only as root
|
| x | x | 0 | invalid
|
| 0 | 0 | x | invalid
|
| x | 0 | x | invalid
|
| 0 | 0 | 0 | invalid
|
| 0 | x | 0 | invalid
|
| x | 0 | 0 | invalid
|
| 0 | x | x | invalid
|
| x | x | x | invalid |
Once again, you should always use TCP. There are few circumstances that justify the use of UDP or RAW. We can take an earlier example as the sender program:
BEGIN {
print strftime() |& "/inet/tcp/8888/0/0"
}
The receiver is almost identical to the first example of this chapter:
BEGIN {
"/inet/tcp/0/localhost/8888" |& getline
print $0
}
TCP can guarantee that the bytes at the receiving end come in in exactly the same order they were sent at the sending end. No byte will be lost (except for broken lines), no byte doubled, no byte out of order. Some overhead is necessary to accomplish this but this is the price we pay for a reliable service.
It does matter which side starts first. The sender/server has to be started first and will wait for the receiver to read a line.
Both programs are almost identical to their TCP counterparts. Only the name of the `protocol' has changed. As before, it does matter which side starts first. The receiving side will block and wait for the sender. So, in this case, the receiver/client has to be started first.
BEGIN {
print strftime() |& "/inet/udp/8888/0/0"
}
The receiver is almost identical to the first example of this chapter:
BEGIN {
"/inet/udp/0/localhost/8888" |& getline
print $0
}
UDP cannot guarantee that the datagrams at the receiving end come in in exactly the same order they were sent at the sending end. Some datagrams could be lost, some doubled, and some out of order. But no overhead is necessary to accomplish this. This unreliable behaviour is good enough for tasks like data acquisition, logging and even stateless services like NFS.
This is an IP level protocol. Only root is allowed to access this
special file. It is meant to be the basis for implementing
and experimenting with transport level protocols. In the most general case,
the sender has to supply the encapsulating header bytes in front of the
packet and the receiver has to strip the additional bytes from the message.
RAW-receivers cannot receive packets sent with TCP or UDP because the
operating system will not deliver the packets to a RAW-receiver. The
operating system knows some protocols on top of IP and will decide on
its own which packet to deliver to which process
see Richard Stevens' home page and
books. Therefore we have to use the UDP-receiver for receiving UDP
datagrams sent with the RAW-sender. This is a dark corner - not only of
GAWK but also of TCP/IP implementations.
Those few interested in playing with protocols will benefit from the
approach implemented in a tool called SPAK,
see.
This tool reflects the hierarchical layering of protocols (encapsulation)
in the was data streams are piped out of one program into the next one.
You can see which protocol is based on which other (lower level) protocol
by looking at the command line ordering of the program calls.
Cleverly thought out, SPAK will serve you much better than GAWK's
`/inet' if you want to learn the meaning of each and every bit in the
protocol headers.
We will use the RAW protocol to emulate the behaviour of UDP. The sender program is the same as above but with some additional bytes that fill the places of the UDP fields.
BEGIN {
Message = "Hello world\n"
SourcePort = 0
DestinationPort = 8888
MessageLength = length(Message)+8
RawService = "/inet/raw/0/localhost/0"
printf("%c%c%c%c%c%c%c%c%s", SourcePort/256, SourcePort%256,
DestinationPort/256, DestinationPort%256,
MessageLength/256, MessageLength%256,
0, 0, Message) |& RawService
fflush(RawService)
}
Since we try to emulate the behaviour of UDP, we will check if
the RAW-sender is understod by the UDP-receiver but not if the RAW-receiver
can understand the UDP-sender (see above). In a real network, the
RAW-receiver will hardly
be of any use because it gets every IP packet that
comes across the network. There will usually be so many packets that
GAWK is too slow for processing them. Only on a network with little
traffic can you test the IP-level receiver program. Programs for analyzing
IP traffic on modem- or ISDN-channels should be possible.
Port numbers do not have a meaning when using /inet/raw. Their fields
have to be 0 or empty. Only TCP and UDP know them. Receiving data from
/inet/raw is difficult not only because of processing speed but also
because data will usually be binary and not restricted to ASCII. This
implies that line separation with RS will not work as usual.
In this chapter, we will have a look at some self contained scripts that meet at least one of these criteria:
GAWK.
Here, the emphasis is on concise networking. The applications mentioned near the end of the first chapter do not meet this requirement because they will result in long programs that need careful examination of many special cases and indepth knowledge of vast areas of well established fields other than networking.
We will often refer to the site independent core of the server that
we built in the first chapter. This means the BEGIN part of the
ELIZA program. When building new and non-trivial servers, we will
always copy this building block and append new instances of the two
functions SetUpServer() and HandleGET().
Does it really make sense to employ this same scheme again and again
with varying content ? Yes, because this scheme of event-driven
execution provides GAWK with an interface to the most widely
accepted standard for GUIs: the web browser. Now, GAWK can even rival
Tcl/Tk.
Tcl and GAWK have much in common. Both are simple scripting languages
that allow us to quickly solve problems with short programs. But Tcl has Tk
on top of it and GAWK had nothing comparable up to now. While Tcl
needs a large and ever changing libray (Tk, which was bound to the X-Windows
environment until recently), GAWK needs just the networking interface
and some kind of browser on the client's side. Besides better portability,
the most important advantage of this approach (embracing well established
standards like HTTP and HTML) is, that we do not need to change the
language. We let others do the work of fighting over protocols and standards.
We use HTML, JavaScript, VRML or whatever comes along to do our work.
You thought the "Hello world" example in the first chapter
was useless ? By adding just a few lines, we can turn it into something
useful. The PANIC program below tells everyone who connects, that the local
site is not working. When a web server breaks down, it makes a difference
if customers get a strange "network unreachable" message or a short message
telling them that the server has a problem. In such a case of emergency
the hard disk and everything on it (including the regular web service) may
be unavailable. Rebooting the web server off a diskette makes sense in this
setting.
To start the PANIC program as an emergency web server, you need just the
GAWK executable and the program below on your diskette. By default,
it will connect to port 8080. You can supply a different value from the
command line.
BEGIN {
RS = ORS = "\r\n"
if (MyPort == 0) MyPort = 8080
HttpService = "/inet/tcp/" MyPort "/0/0"
Hello = "<HTML><H1>This site is temporarily out of service.</H1></HTML>"
while ("GAWK" < "Perl") {
print "HTTP/1.0 200 OK" |& HttpService
print "Content-Length: " length(Hello) + length(ORS) ORS |& HttpService
print Hello |& HttpService
while ((HttpService |& getline) > 0) ;
close(HttpService)
}
}
GETURL is a versatile building block for shell scripts that need to retrieve files from the net. It takes a web address as command line parameter and tries to retrieve the contents of this address. The contents is printed via standard ouput, while the header is printed via `/dev/stderr'. A surrounding shell script could analyze the contents and extract the text or the links. An ASCII browser could be written around GETURL. But also web robots are straightforward to write on top of GETURL. On the Internet, you can find several programs of the same name that do the same job. They are usually much more complex internally and at least 10 time longer.
At first, GETURL checks if it was called with exactly one web address. Then, it checks if the user chose to use a special proxy server whose name he handed over in a variable. By default, it is assumed that the local machine serves as proxy. GETURL uses the GET method by default to access the web page. By handing over the name of a different method (like HEAD) it is possible to choose a different behaviour. With the HEAD method, the user will not receive the body of the page content but the header.
BEGIN {
if(ARGC != 2) {
print "GETURL - retrieve web page via HTTP 1.0"
print "IN:\n the URL as a command line parameter"
print "PARA:\n -v Proxy=MyProxy"
print "OUT:\n the page content on stdout"
print " the page header on stderr"
print "JK 16.05.97"
exit
}
URL = ARGV[1]; ARGV[1] = ""
if (Proxy == "") Proxy = "127.0.0.1"
if (ProxyPort == 0) ProxyPort = 80
if (Method == "") Method = "GET"
HttpService = "/inet/tcp/0/" Proxy "/" ProxyPort
RS = "\r\n\r\n"
printf Method " " URL " HTTP/1.0" RS |& HttpService
HttpService |& getline Header
printf Header RS > "/dev/stderr"
while ((HttpService |& getline) > 0) printf "%s", $0
}
You may change this program as needed, but be careful with the last lines.
Make sure transmission of binary data is not corrupted by additional line
breaks. Even as it is now, the byte sequence "\r\n\r\n" would
disappear if it was contained in binary data. You might get caught in a
trap when trying a quick fix on this one.
Today, you often find powerful processors in embedded systems. Dedicated network routers and controllers for all kinds of machinery are examples of embedded systems. Processors like the Intel x86 or the AMD Elan are able to run multitasking operating systems like XINU or Linux in embedded PCs.
These systems are small and usually do not have a keyboard or a display. Therefore it is difficult to set up their configuration. There are several widespread ways to set them up:
In this section, we will look at a solution that uses HTTP connections
to control variables of an embedded system that are stored in a file.
Since embedded systems have tight limits on resources like memory,
it is difficult to employ advanced techniques like SNMP- and HTTP-
servers. So, GAWK fits in quite nicely with its single executable
that needs just a short script to start working.
The following program stores the variables in a file and a concurrent
process in the embedded system may read the file. The program uses the site
independent part of the simple web server that we developed in the first
chapter. As mentioned there, all we have to do is to write two new procedures
SetUpServer() and HandleGET().
function SetUpServer() {
TopHeader = "<HTML><title>Remote Configuration</title>"
TopDoc = "\
<h2>Please choose one of the following actions:</h2>\
<UL>\
<LI>About this server\
<LI>Read Configuration\
<LI>Check Configuration\
<LI>Change Configuration\
<LI>Save Configuration\
</UL>"
TopFooter = "</HTML>"
if (ConfigFile == "") ConfigFile = "config.asc"
}
The function SetUpServer() initializes the top level HTML texts
as usual. It also initializes the name of the file that contains the
configuration parameters and their values. In case the user supplied
a name from the command line, that name is used. The file is expected to
contain one parameter per line, with the name of the parameter in
column one and the value in column two.
The function HandleGET() reflects the structure of the menue
tree as usual. The first menue choice tells the user what this is all
about. The second choice reads the configuration file line by line
and stores the parameters and their values. Notice that the record
separator for this file is \n in contrast to the record separator
of the HTTP. The third menue choice builds up an HTML table to show
the content of the configuration file just read. The fourth choice
does the real work of changing parameters and the last one just saves
the configuration into a file.
function HandleGET() {
if(MENUE[2] =="AboutServer") {
Document = "This is a GUI for remote configuration of an\
embedded system. It is is implemented as one GAWK script."
} else if (MENUE[2]=="ReadConfig") {
RS = "\n"
while ((getline < ConfigFile) > 0) config[$1] = $2;
close(ConfigFile)
RS = "\r\n"
Document = "Configuration has been read."
} else if (MENUE[2]=="CheckConfig") {
Document = "<TABLE BORDER CELLPADDING=5>"
for (i in config)
Document = Document "<TR><TD>" i "</TD><TD>" config[i] "</TD></TR>"
Document = Document "</TABLE>"
} else if (MENUE[2]=="ChangeConfig") {
if ("Param" in GETARG) { # any parameter to set ?
if (GETARG["Param"] in config) { # is parameter valid ?
config[GETARG["Param"]] = GETARG["Value"]
Document = GETARG["Param"] " = " GETARG["Value"] "."
} else {
Document = "Parameter <b>" GETARG["Param"] "</b> is invalid."
}
} else {
Document = "<FORM method=GET><h4>Change one parameter</h4>\
<TABLE BORDER CELLPADDING=5>\
<TR><TD>Parameter</TD><TD>Value</TD></TR>\
<TR><TD><input type=text name=Param value=\"\"size=20></TD>\
<TD><input type=text name=Value value=\"\"size=40></TD>\
</TR></TABLE><input type=submit value=\"Set\"></FORM>"
}
} else if (MENUE[2]=="SaveConfig") {
printf "" > ConfigFile # make this file an empty file
for (i in config) printf("%s %s\n", i, config[i]) >> ConfigFile
close(ConfigFile)
Document = "Configuration has been saved."
}
}
We could also view the configuration file as a data base. From this point of view, the above program acts like a primitive data base server. Real SQL data base systems also make a service available by providing a TCP port that clients can connect to. But the application level protocols they use are usually proprietary and also change from time to time. This is also true for the protocol that MiniSQL uses.
Most people who make heavy use of Internet resources have a large bookmark file with pointers to interesting web sites. It is impossible to regularly check by hand if any of these sites has changed. A program is needed to automatically look at the headers of web pages and tell, which ones have changed. URLCHK does the comparison after using GETURL with the HEAD method to retrieve the header.
Like GETURL, this program checks first if it was called with exactly
one command line parameter. URLCHK also takes the same variables
Proxy and ProxyPort from the command line as GETURL
because these variables are handed over to GETURL for each URL
that gets checked. The one and only parameter is the name of a file that
contains one line for each URL. In the first column we find the URL and
the second and third columns hold the length of the URL's body when checked
for the two last times. Now, we follow this plan:
It may seem a bit peculiar to read the URLs from a file together with their two most recent lengths but this approch has several advantages. You can call the program again and again with the same file. After running the program, you can regenerate the changed URLs by extracting those lines that differ in their second and third columns.
BEGIN {
if(ARGC != 2) {
print "URLCHK - check if URLs have changed"
print "IN:\n the file with URLs as a command line parameter"
print " file contains URL, old length, new length"
print "PARA:\n -v Proxy=MyProxy -v ProxyPort=8080"
print "OUT:\n same as file with URLs"
print "JK 02.03.98"
exit
}
URLfile = ARGV[1]; ARGV[1] = ""
if (Proxy != "") Proxy = "-vProxy=" Proxy " "
if (ProxyPort != "") ProxyPort = "-vProxyPort=" ProxyPort " "
while ((getline < URLfile) > 0) Length[$1] = $3 + 0
close(URLfile) # now, URLfile is read in and can be updated
printf "" > URLfile # make URLfile an empty file
GetHeader = "gawk " Proxy ProxyPort " -vMethod=\"HEAD\" -f geturl.awk "
for (i in Length) {
GetThisHeader = GetHeader i " 2>&1"
while ((GetThisHeader | getline) > 0)
if (toupper($0) ~ /CONTENT\-LENGTH/) NewLength = $2 + 0
close(GetThisHeader)
print i, Length[i], NewLength >> URLfile
if (Length[i] != NewLength) # report only changed URLs
print i, Length[i], NewLength
}
}
Another thing that may look strange is the way GETURL gets called.
Before calling GETURL, we have to look if the proxy variables need
to be handed over. If so, we prepare strings that will become part
of the command line later. In GetHeader we store these strings
together with the longest part of the command line. Later, in the loop
over the URLs, GetHeader is appended with the URL and a redirection
operator to form the command that reads the URL's header over the net.
GETURL always produces the headers over `/dev/stderr'. That is
the reason why we need the redirection operator to get the header
piped in.
This program is not perfect because it assumes that changing URLs results in changed lengths, which is not necessarily true. A more advanced approch would be to look at some other header line that holds time information. But as always when things get a bit more complicated, this is left as an exercise to the reader.
Sometimes it is necessary to extract links from web pages. Browsers do it, web robots do it and sometimes even a human being wants to extract all links in a web page. Having a tool like GETURL at hand, we can solve this problem with a one-liner in the bourne shell.
BEGIN {RS="http\://[#%&\+\-\./0-9\:;\?A-Z_a-z\~]*"};
RT != "" { print "gawk -vProxy=MyProxy -f geturl.awk", RT, "> doc" NR ".html" }
This program reads an HTML file and prints all HTTP-links that it finds.
It relies on GAWK's ability to use regular expressions as record
separators. With RS set to a regular expression that matches links,
the second action is executed each time a non-empty link has been found.
We can find the matching link itself in RT.
Notice, that the regular expression for URLs is rather crude. A precise
regular expression would be much more complex. But the one above works
rather well. It is also unable to find internal links of an HTML document.
Furthermore, it is straightforward to also include FTP,
Telnet, News, Mailto and other kinds of links in the regular expression
if this was necessary for other tasks.
The action could use the system() call to let just another GETURL
retrieve the page, but here we use a different approach. Instead, we let
the one-liner print a shell command that can be piped into a sh
command in the bourne shell. This way it is possible to first extract
the links, wrap shell commands around them and pipe all the shell commands
into a file. After editing the file, execution of the file will retrieve
exactly those files that we really need. In case we do not want to edit,
we can retrieve all the pages like this:
gawk -f geturl.awk http://www.suse.de | gawk -f webgrab.awk | sh
After this, you will find the contents of all referenced documents in files named `doc*.html' even if they do not contain HTML code. The most annoying thing is that we always have to pass the proxy to GETURL. If you do not like to see the headers of the web pages appear on the screen, you can redirect them to `/dev/null'. Watching the headers appear can be quite interesting because you can see interesting details like which web server the companies use. Now, you can imagine how the robots work that tell the clever marketing people the market shares of Microsoft and Netscape in the web server market.
Port 80 of any web server is like a small hole in a repellent firewall. After attaching a browser to port 80, we will usually catch a glimpse of the bright side of the server (its home page). With a tool like GETURL at hand, we are able to discover some of the more concealed or even indecent (i.e. lacking conformity to standards of quality) services. It can be exciting to see the fancy CGI scripts that lie there, revealing the inner working of the server, and ready to be called.
gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/some servers will give you a directory listing of the CGI files. Knowing the names, you can try to call some of them and watch for useful results. Sometimes there are executables in such directories, (like Perl interpreters) that you may call remotely. If there are subdirectories with configuration data of the web server, this can also be quite interesting to read.
gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/test-cgi gawk -f geturl.awk http://any.host.on.the.net/cgi-bin/printenv
Although this may sound funny or simply irrelevant, we are talking about severe security holes. Try to explore your own system this way and make sure that none of the above reveals too much information about your system.
In all the HTTP server examples above, we have never presented an image
to the browser and its user. Presenting images is one task. Generating
images that reflect some user input and presenting these dynamically
generated images is another. In this section we will use
GNUPLOT
for generating `.gif' files. Due to licensing problems, the default
installation of GNUPLOT has the generation of `.gif' files switched
off. If your installed version does not accept set term gif,
just download and install the most recent version of GNUPLOT and the
GD library by Thomas Boutell.
Otherwise you still have the chance to generate some
ASCII-art style images with GNUPLOT by using set term dumb
(I tried it and it worked).
The program we will develop takes the statistical parameters of two samples
and computes the t-test statistics. As a result, we get the probabilities
that the means and the variances of both sample are the same. In order to
let the user check plausibility, the program presents an image of the
distributions. The statistical computation follows
Numerical Recipes
by Press et al. Since GAWK does not have a builtin function
for the computation of the beta function, we use the ibeta()
of GNUPLOT. As a side effect, we will learn how to use GNUPLOT as a
sophisticated calculator. The comparison of means is done as in tutest,
paragraph 14.2 page 613 and the comparison of variances as in ftest,
page 611 in Numerical Recipes.
As usual, we take the site independent code for servers and append
our own functions SetUpServer() and HandleGET().
function SetUpServer() {
TopHeader = "<HTML><title>Statistics with GAWK</title>"
TopDoc = "\
<h2>Please choose one of the following actions:</h2>\
<UL>\
<LI>About this server\
<LI>Enter Parameters\
</UL>"
TopFooter = "</HTML>"
GnuPlot = "gnuplot 2>&1"
m1=m2=0; v1=v2=1; n1=n2=10
}
Here, you see the menue structure that the user will see. Later, we
will see how the program structure of the HandleGET() function
reflects the menue structure. What is missing here, is the link for the
image we will generate. In an event driven environment, request,
generation and delivery of images are separated.
Notice the way we initialize the GnuPlot pipe. By default,
GNUPLOT will output the generated image via standard output and
the results of printed calculations via standard error.
The redirection will cause standard error to be mixed into standard
ouput, enabling us to read results of calculations with getline.
By initializing the statistical parameters with some meaningful
defaults, we make sure the user will get an image the first time
he uses the program.
function HandleGET() {
if(MENUE[2] =="AboutServer") {
Document = "This is a GUI for a statistical computation.\
It compares means and variances of two distributions.\
It is implemented as one GAWK script and uses GNUPLOT."
} else if (MENUE[2]=="EnterParameters") {
Document = ""
if ("m1" in GETARG) { # are there parameters to compare ?
Document = Document "
"
m1=GETARG["m1"]; v1=GETARG["v1"]; n1=GETARG["n1"]
m2=GETARG["m2"]; v2=GETARG["v2"]; n2=GETARG["n2"]
t = (m1-m2)/sqrt(v1/n1+v2/n2)
df = (v1/n1+v2/n2)*(v1/n1+v2/n2)/((v1/n1)*(v1/n1)/(n1-1) + (v2/n2)*(v2/n2)/(n2-1))
if (v1>v2) { f = v1/v2; df1=n1-1; df2=n2-1
} else { f = v2/v1; df1=n2-1; df2=n1-1
}
print "pt=ibeta(" df/2 ",0.5," df/(df+t*t) ")" |& GnuPlot
print "pF=2.0*ibeta(" df2/2 "," df1/2 "," df2/(df2+df1*f) ")" |& GnuPlot
print "print pt, pF" |& GnuPlot
RS="\n"; GnuPlot |& getline; RS="\r\n" # $1 is pt, $2 is pF
print "invsqrt2pi=1.0/sqrt(2.0*pi)" |& GnuPlot
print "nd(x)=invsqrt2pi/sd*exp(-0.5*((x-mu)/sd)**2)" |& GnuPlot
print "set term gif medium size 320,240" |& GnuPlot
print "set yrange[-0.3:]" |& GnuPlot
print "set label 'p(m1=m2) =" $1 "' at 0,-0.1 left" |& GnuPlot
print "set label 'p(v1=v2) =" $2 "' at 0,-0.2 left" |& GnuPlot
print "plot mu=" m1 ",sd=" sqrt(v1) ", nd(x) title 'sample 1',\
mu=" m2 ",sd=" sqrt(v2) ", nd(x) title 'sample 2'" |& GnuPlot
print "quit" |& GnuPlot
GnuPlot |& getline Image
while ((GnuPlot |& getline) > 0) Image = Image RS $0
close(GnuPlot)
}
Document = Document "\
<h3>Do these samples have the same Gaussian distribution ?</h3>\
<FORM METHOD=GET> <TABLE BORDER CELLPADDING=5>\
<TR>\
<TD>1. Mean </TD><TD><input type=text name=m1 value=" m1 " size=8></TD>\
<TD>1. Variance</TD><TD><input type=text name=v1 value=" v1 " size=8></TD>\
<TD>1. Count </TD><TD><input type=text name=n1 value=" n1 " size=8></TD>\
</TR><TR>\
<TD>2. Mean </TD><TD><input type=text name=m2 value=" m2 " size=8></TD>\
<TD>2. Variance</TD><TD><input type=text name=v2 value=" v2 " size=8></TD>\
<TD>2. Count </TD><TD><input type=text name=n2 value=" n2 " size=8></TD>\
</TR> <input type=submit value=\"Compute\">\
</TABLE></FORM><BR>"
} else if (MENUE[2]=="Image") {
Reason = "OK" ORS "Content-type: image/gif"
Header = Footer = ""
Document = Image
}
}
As usual, we give a short description of the service in the first
menue choice. The third menue choice shows us that generation and
presentation of an image are two separate actions. While the latter
takes place quite instantly in the third menue choice, the former
takes place in the much longer second choice. Image data passes from the
generating action to the presenting action via the variable Image
that contains a complete .gif image that would otherwise be stored
in a file. A bit unusual is the way we pass the Content-type to
the browser. It is appended to the OK of the first header line
to make sure, the type information will become part of the header.
The other variables that get transmitted acrosss the network are
made empty because in this case we do not have an HTML document to
transmit but nothing but raw image data to be contained in the body.
Most of the work is done in the second menue choice. It can be broken down into two phases. At first, we check if there are statistical parameters. When you first start the program, there usually are no parameters because you are entering the page coming from the top menu. Then, we only have to present the user a form that he can use to change statistical parameters and submit them. Subsequently, the submission of the form will cause the execution of the first phase because now there are parameters to handle.
Now that we have parameters, we know there will be an image available.
Therefore we place a link to the image at the top of the page. Then,
we prepare some variables that will be passed to GNUPLOT for calculation
of the probabilities. Prior to reading the results, we must temporarily
change RS because GNUPLOT separates lines with newlines.
After instructing GNUPLOT to generate a .gif image with given
font, width and height, we initiate the insertion of some text,
explaining the resulting probabilities. The final plot command
actually generates image data. This raw binary has to be read in carefully
without adding, changing or deleting a single byte. Hence the unusual
initialization of Image and completion with a while loop.
When using this server, you will soon realize that it is far from being
perfect.
After the first submission of parameters, your browser will show you an
image. But due to the caching strategy of most browsers, there will be
no automatic reload of images. So, after pressing the submission button,
you also have to press the reload button. This problem is not a GAWK
problem, it is one of the shortcomings of the HTML- and HTTP-based WWW.
One can solve the problem with a JavaScript program. Furthermore, the
statistical part of the server does not take care of invalid input.
Among others, using negative variances will cause invalid results.
By now, we know how to present arbitrary Content-types to a browser.
In this section, our server will present a 3D world to our browser.
The 3D world is described in a scene description language (VRML,
Virtual Reality Modeling Language) that allows us to travel through a
perspective view of a 2D maze with our browser. If your browser has a
VRML plugin, you will be able to explore this new technology. We could do
one of those boring Hello world examples here, that are usually
presented when introducing novices to
VRML. If you have never written
any VRML code, have a look at
the VRML FAQ.
Presenting a static VRML scene is a bit trivial; in order to expose
GAWK's new capabilities, we will present a dynamically generated
VRML scene. The function SetUpServer() is very simple because it
only sets the default HTML page and initializes the random number
generator. As usual, the surrounding server lets you browse the maze.
function SetUpServer() {
TopHeader = "<HTML><title>Walk through a maze</title>"
TopDoc = "\
<h2>Please choose one of the following actions:</h2>\
<UL>\
<LI>About this server\
<LI>Watch a simple VRML scene\
</UL>"
TopFooter = "</HTML>"
srand()
}
The function HandleGET() is a bit longer because it first computes
the maze and afterwards generates the VRML code that will be sent across
the network. As shown in the STATIST example, we set the type of the
content to VRML and then store the VRML representation of the maze as the
page content. We assume that the maze is stored in a 2D array. Initially,
the maze consists of walls only. Then, we add an entry and an exit to the
maze and let the rest of the work be done by the function MakeMaze.
Now, only the wall fields are left in the maze. By iterating over the these
fields, we generate one line of VRML code for each wall field.
function HandleGET() {
if(MENUE[2] =="AboutServer") {
Document = "If your browser has a VRML 2 plugin,\
this server shows you a simple VRML scene."
} else if (MENUE[2]=="VRMLtest") {
XSIZE=YSIZE=11 # initially, everything is wall
for(y=0; y<YSIZE; y++) for(x=0; x<XSIZE; x++) Maze[x, y] = "#"
delete Maze[0, 1] # entry is not wall
delete Maze[XSIZE-1, YSIZE-2] # exit is not wall
MakeMaze(1, 1)
Document = "\
#VRML V2.0 utf8\n\
Group {\n\
children [\n\
PointLight {\n\
ambientIntensity 0.2\n\
color 0.7 0.7 0.7\n\
location 0.0 8.0 10.0\n\
}\n\
DEF B1 Background {\n\
skyColor [0 0 0, 1.0 1.0 1.0 ]\n\
skyAngle 1.6\n\
groundColor [1 1 1, 0.8 0.8 0.8, 0.2 0.2 0.2 ]\n\
groundAngle [ 1.2 1.57 ]\n\
}\n\
DEF Wall Shape {\n\
geometry Box {size 1 1 1}\n\
appearance Appearance { material Material { diffuseColor 0 0 1 } }\n\
}\n\
DEF Entry Viewpoint {\n\
position 0.5 1.0 5.0\n\
orientation 0.0 0.0 -1.0 0.52\n\
}\n"
for (i in Maze) {
split(i, t, SUBSEP)
Document = Document " Transform { translation "
Document = Document t[1] " 0 -" t[2] " children USE Wall }\n"
}
Document = Document " ] # end of group for world\n}"
Reason = "OK" ORS "Content-type: model/vrml"
Header = Footer = ""
}
}
Finally, we have a look at MakeMaze(), the function that generated
the Maze array. When entered, this function assumes that the array
has been initialized so that each element represents a wall element and
the maze is initially full of wall elements. Only the entrance and the exit
of the maze should have been left free. The parameters of the function tell
us, which element must be marked as not being a wall. After this, we take
a look at the four neighbouring elements and remember, which we have already
treated. Of all the neighbouring elements, we take one at random and
walk in that direction. Therefore, the wall element in that direction has
to be removed and then, we call the function recursively for that element.
The maze will only be completed, if we reiterate the above procedure for
all neighbouring elements (in random order) and for our present
element by recursively calling the function for the present element. This
last iteration could have been done in a loop,
but it is done much simpler recursively.
Notice that elements with coordinates that are both odd are assumed to be
on our way through the maze and the generating process cannot terminate
as long as there is such an element not being deleted. All other
elements are potentially part of the wall.
function MakeMaze(x, y) {
delete Maze[x, y] # here we are, we have no wall here
p = 0 # count unvisited fields in all directions
if (x-2 SUBSEP y in Maze) d[p++] = "-x"
if (x SUBSEP y-2 in Maze) d[p++] = "-y"
if (x+2 SUBSEP y in Maze) d[p++] = "+x"
if (x SUBSEP y+2 in Maze) d[p++] = "+y"
if (p>0) { # if there are univisited fields, go there
p = int(p*rand()) # choose one unvisited field at random
if (d[p] == "-x") { delete Maze[x - 1, y]; MakeMaze(x - 2, y)
} else if (d[p] == "-y") { delete Maze[x, y - 1]; MakeMaze(x, y - 2)
} else if (d[p] == "+x") { delete Maze[x + 1, y]; MakeMaze(x + 2, y)
} else if (d[p] == "+y") { delete Maze[x, y + 1]; MakeMaze(x, y + 2)
} # we are back from recursion
MakeMaze(x, y); # try again while there are unvisited fields
}
}
A mobile agent is a program that may be dispatched from a computer and transported to a remote server for execution. This is called migration and means that a process on another system is started that is independent from its originator. Ideally, it wanders through a network while working for its creator or owner. In places like the UMBC Agent Web people are quite confident that (mobile) agents are a software engineering paradigm that will enable us to significantly increase the efficiency of our work. Mobile agents could become the mediators between users and the networking world. If you appreciate an unbiased view at this technology, you should have a look at the remarkable paper Mobile Agents: Are they a good idea ?.
Anyway, sounds interesting, let us have a try.
A good instance of this paradigm is
Agent Tcl,
an extension of the Tcl language. After introducing you to a typical
development environment, the aforementioned paper shows a nice little
example application that we will try to rebuild in GAWK. The
`who' agent takes a list of servers and wanders from one server
to the next one, always looking, who is logged in. Having reached the last
one, it sends us a message with a list of all users it found on each
machine.
But before implementing something that might or might not be a mobile agent, let us clarify the concept and some important terms. The agent paradigm in general is such a young scientific discipline that it has not yet developed a widely accepted terminology. Some authors try to give precise definitions, but their scope is often not wide enough to be generally accepted. Franklin and Graesser ask Is it an Agent or just a Program: A Taxonomy for Autonomous Agents and give even better answers than Caglayan and Harrison in their Agent FAQ.
Before delving into the (rather demanding) details of implementation, let me give just one more quotation as a final motivation. Steven Farley published an excellent paper called Mobile Agent System Architecture, in which he asks Why use an agent architecture ?
If client-server systems are the currently established norm and distributed object systems such as CORBA are defining the future standards, why bother with agents? Agent architectures have certain advantages over these other types. Three of the most important advantages are:
1. An agent performs much processing at the server where local bandwidth is high, thus reducing the amount of network bandwidth consumed and increasing overall performance. In contrast, a CORBA client object with the equivalent functionality of a given agent must make repeated remote method calls to the server object because CORBA objects cannot move across the network at runtime.
2. An agent operates independently of the application from which the agent was invoked. The agent operates asynchronously, meaning that the client application does not need to wait for the results. This is especially important for mobile users who are not always connected to the network.
3. The use of agents allows for the injection of new functionality into a system at run time. An agent system essentially contains its own automatic software distribution mechanism. Since CORBA has no built-in support for mobile code, new functionality generally has to be installed manually.
Of course a non-agent system can exhibit these same features with some work. But the mobile code paradigm supports the transfer of executable code to a remote location for asynchronous execution from the start. An agent architecture should be considered for systems where the above features are primary requirements.
When trying to migrate a process from one system to the next, we need of course a server process on the receiving side. Depending on the kind of server process, several ways of implementation come to mind:
We will abuse a common web server as a migration tool. So, we need a
universal CGI script on the receiving side (the web server). It will be
activated with a POST request. Put it into a location like
`/httpd/cgi-bin/PostAgent.sh'. Make sure, the server system uses a
version of GAWK that supports network access.
#!/bin/sh
MobAg=/tmp/MobileAgent.$$
cat > $MobAg # direct script to mobile agent file
gawk -f $MobAg $MobAg > /dev/null & # execute agent concurrently
gawk 'BEGIN{print "\r\nAgent started"}' # HTTP header, terminator and body
rm $MobAg # delete script file of agent
By making its process id $$ part of the unique file name, the
script avoids conflicts between concurrent instances of the script.
First, all lines
from standard input (the mobile agent's source code) are copied into
this unique file. Then, the agent is started as a concurrent process
and a short message reporting this fact is sent to the submitting client.
Finally, the script file of the mobile agent is removed because it is
no longer needed. Although a short script, there are several noteworthy
points about it:
GAWK is not the ideal language for such
a job. Lisp and Tcl are more suitable because they do not make a distinction
between program code and data.
The originating agent itself is started just like any other command line
script and reports the results on standard output.
But how can an agent that migrated to a host far away from its origin report
the result back home when there is no connection any more ? By letting the
name of the original host migrate with the agent. Having arrived at the end
of the journey, the agent establishes a connection and reports the results.
This is the reason for determining the name of the host with `uname -n'
and storing it in MyOrigin for later use.
We may also set variables with the `-v' option from the
command line. This interactivity is only of importance in the context of starting a mobile
agent, therefore this BEGIN pattern and its action will not take
part in migration.
BEGIN {
if (ARGC != 2) {
print "MOBAG - a simple mobile agent"
print "CALL:\n gawk -f mobag.awk mobag.awk"
print "IN:\n the name of this script as a command line parameter"
print "PARA:\n -vMyOrigin=myhost.com"
print "OUT:\n the result on stdout"
print "JK 29.03.98 01.04.98"
exit
}
if (MyOrigin == "") "uname -n" | getline MyOrigin
}
Since GAWK cannot manipulate and transmit parts of the program
directly, we have to read the source code and store it in strings.
Therefore, we scan it for the beginning and the ending of functions.
Each line in between will be appended to the code string until the end of
the function has been reached. A special case is this part of the program
itself. It is not a function. We put a similar frame around it to treat
it like a function. Notice that this mechanism will work for all the
functions of the source code, but it cannot guarantee that the order
of the functions will be preserved during migration.
#ReadMySelf
/^function / { FUNC = $2 }
/^END/ || /^#ReadMySelf/ { FUNC = $1 }
FUNC != "" { MOBFUN[FUNC] = MOBFUN[FUNC] RS $0 }
(FUNC!="") && (/^}/ || /^#EndOfMySelf/) { FUNC = "" }
#EndOfMySelf
When we built the web server code in the first chapter, we first developed
a site independent core. Likewise, we now build an agent independent core
that can be appended with application dependent functions.
Meanwhile, we have already reached the only application independent function
we need for the mobile agent. The function migrate() prepares the
aforementioned strings containing the program code and transmits them to a
server. A consequence of this modular approach is that the migrate()
function takes some parameters that we will not need in this application
but in future ones. Its mandatory parameter Destination holds the
name (or IP address) of the server that the agent wants as a host for its
code. The optional parameter MobCode may contain some GAWK
code that will be inserted during migration in front of all other code.
The optional parameter Label may contain
a string that tells the agent what to do in program execution after
arrival at its new home site. One of the serious obstacles in implementing
a framework for mobile agents is that it does not suffice to migrate the
code. We also have to migrate the state of execution of the agent. In
contrast to Agent Tcl, we do not try to migrate the complete set
of variables. We introduce the following convention.
MOBFUN that we saw above is an exception. It is handled
by the function migrate() and will migrate with the application.
MOBVAR. Each variable that shall
take part in migration has to be an element of this array.
migrate() will also take care of this.
Now, you can understand what happens to the Label parameter of the
function migrate(). It is copied into MOBVAR["Label"] and
travels alongside the other data. Since travelling takes place via HTTP,
we have to separate records with "\r\n" in RS and
ORS as usual. The code assembly for migration takes place in
three steps:
MOBFUN to collect all functions verbatim.
BEGIN pattern and put assignments to mobile
variables into the action part.
GETURL: the header with the request
and the Content-length is followed by the body. In case there is
any reply over the network, we read it completely and echo it to
standard ouput to avoid irritating the server.
function migrate (Destination, MobCode, Label) {
MOBVAR["Label"] = Label
MOBVAR["Destination"] = Destination
RS = ORS = "\r\n"
HttpService = "/inet/tcp/0/" Destination
for (i in MOBFUN) MobCode = MobCode "\n" MOBFUN[i]
MobCode = MobCode "\n\nBEGIN {"
for (i in MOBVAR) MobCode = MobCode "\n MOBVAR[\"" i "\"] = \"" MOBVAR[i] "\""
MobCode = MobCode "\n}\n"
print "POST /cgi-bin/PostAgent.sh HTTP/1.0" |& HttpService
print "Content-length:", length(MobCode) ORS |& HttpService
printf MobCode |& HttpService
while ((HttpService |& getline) > 0) print $0
}
The application independent framework is now almost complete. What follows
is the END pattern that is executed when the mobile agent has
finished reading its own code. First, it checks whether it is already
running on a remote host or not. In case initialization has not yet taken
place, it starts MyInit(). Otherwise (later, on a remote host) it
starts MyJob().
END {
if (ARGC != 2) exit # stop when called with wrong parameters
if (MyOrigin != "") # is this the originating host ?
MyInit() # then we initialize the application
else # we are on a host with migrated data
MyJob() # so we do our job
}
All we have to do to extend the framework into a complete application
is to write two application specific functions MyInit() and
MyJob(). Keep in mind that the former is executed once on the
originating host, while the latter is executed after each migration.
function MyInit() {
MOBVAR["MyOrigin"] = MyOrigin
MOBVAR["Machines"] = "localhost/80 max/80 moritz/80 castor/80"
split(MOBVAR["Machines"], Machines) # which host is the first ?
migrate(Machines[1], "", "") # go to the first host
while (("/inet/tcp/8080/0/0" |& getline)>0) # wait for result
print $0 # print result
}
As mentioned earlier, this agent takes the name of its origin
(MyOrigin) with it. Then, it takes the name of its first
destination and goes there for further work. Notice, that this name has
the port number of the web server appended to the name of the server
because the function migrate() needs it this way. Finally, it
waits for the result to arrive.
function MyJob() {
sub(MOBVAR["Destination"], "", MOBVAR["Machines"]) # forget this host
MOBVAR["Result"]=MOBVAR["Result"] SUBSEP SUBSEP MOBVAR["Destination"] ":"
while (("who" | getline) > 0) # who is logged in ?
MOBVAR["Result"] = MOBVAR["Result"] SUBSEP $0
if (index(MOBVAR["Machines"], "/") > 0) { # any more machines to visit ?
split(MOBVAR["Machines"], Machines) # which host is next ?
migrate(Machines[1], "", "") # go there
} else { # no more machines
gsub(SUBSEP, "\n", MOBVAR["Result"]) # send result to origin
print MOBVAR["Result"] |& "/inet/tcp/0/" MOBVAR["MyOrigin"] "/8080"
}
}
After migrating, the first thing to do in MyJob() is to delete
the name of the current host from the list of hosts to visit. Now, it
is time to start the real work by appending the host's name to the
result string and reading line by line who is logged in on this host.
A very annoying circumstance is the fact that the elements of
MOBVAR cannot hold the newline character (`\n'). If they
did, migration of this string would not work because the string would
not obey the syntax rule for a string in GAWK. As a replacement,
we use SUBSEP temporarily. If the list of hosts to visit holds
at least one more entry, the agent migrates to that place to go on
working there. Otherwise, it is time to replace the SUBSEPs
with a newline character in the resulting string and report it to
the originating host, whose name is stored in MOBVAR["MyOrigin"].
finger, finger
getline
|&
This document was generated on 27 December 1998 using the texi2html translator version 1.51a.