Programming Languages
The most popular Web languages available
HTML
The files has the extension(s) – .html, .htm, .html4. HyperText Markup Language (HTML), invented by Tim Berners-Lee in 1989, is the undergirding framework of the Internet. Almost every Web site in existence uses the HTML language to display text, graphics, sounds, and animations. The caretaker of the specification is the World Wide Web Consortium (W3C) (http://www.w3.org). HTML language supports hyperlinking, or the ability to link from one word or area of a page to another area or page. HTML is made up of a series of “elements”, which tells the receiving browser to display certain elements on the screen.
List of HTML elements and their security implications are listed
HTML Element/Attribute | Security Implications |
<form> | Form for user input. Whenever a program accepts user input, a security risk exists. In fact, that is how most attacks occur—submission of characters to a program that isn’t expecting them, resulting in abnormal results. |
<form action> | Action attribute. This <form> attribute defines the executing program on the Web server. By knowing the name of the program processing the user-supplied data, an attacker can learn valuable information about the Web server and potentially find backup or older versions of the program in the same or other directories. |
<form method> | Method attribute. This <form> attribute defines the mechanism for sending user-supplied information to the Web server’s processing program. Two methods exist for submitting information to the program: POST and GET. By understanding the method of submission, an attacker can listen to the information (which may be sensitive in nature), or worse, alter the information being sent and produce abnormal results. |
<script language=<variable>> | It is used for scripting. The <script> element, used in conjunction with the “language” attribute, allows an attacker to modify any client-side scripting being sent to the browser. When an attacker can modify the client-side scripting, she can then bypass certain filtering or sanitization scripts. Client side scripting languages like Javascript, VBScript, etc. |
<input> | Input form control. The <input> element allows for an input control for a form. Specific attributes can be altered to send undesirable data to the Web server. |
<input type=hidden> | Type attribute. The “type” attribute, when assigned the value of “hidden,” can allow an attacker to change the “value” attribute to something undesirable. For example, some Web sites use a hidden attribute to store the price of an item in a shopping cart, which allows an attacker to change the price of that item manually to whatever she wants. If there is no server-side processing and validation of the price, an attacker can purchase items online at significantly reduced prices. |
<input maxlength=<variable>> | Maxlength attribute. The “maxlength” attribute can be altered by an attacker, causing it to submit large strings that can disable a Web server if not preprocessed appropriately. |
<input size=<variable>> | Size attribute. Similar to the “maxlength” attribute, the “size” attribute can be altered by an attacker causing it to submit large strings that can disable a Web server if not preprocessed appropriately. |
<applet> | Java applet. This element is used to display or run a Java applet. Because Java is transmitted in the clear and uses a known byte-code for execution, it can be seen on the wire by using a packet analyzer such as Snort or EtherPeek. For more information about decompiling Java applets and the <applet> tag. |
<object> | This element typically is used for displaying an ActiveX control, but it can also be used for Java applets. An attacker can send an e-mail with HTML embedded and have the reader execute an ActiveX control, which can take over the system. The <object> element is among the best ways to propagate an e-mail virus. |
<embed> | This element typically is used in conjunction with the <object> tag to display ActiveX controls and Netscape plug-ins. |
Dynamic HTML (DHTML)
It’s file extension is .dhtml. DHTML is often considered the object version of HTML. This language extends the HTML language to allow for increased control over page elements by allowing them to be accessed and modified by a scripting language such as Javascript or VBScript. As a result, an image tag may have an “OnMouseOver” event triggered when a user places his mouse over the image tag creating an object that comes to life with animation. Briefly, DHTML allows a Web developer to rapidly develop enhanced animations and effects such as mouseover effects, animated text, and dynamic color changes. Because DHTML is based on HTML, its security implications are similar to those for HTML.
XML
It’s file extension is .xml. A derivative of SGML, eXtensible Markup Language (XML) is less restrictive than HTML’s well-defined, standardized elements. XML allows anyone to create her own elements and extend the language herself. At the heart of this element extension are the Document Type Definitions (DTDs), which are similar in function to the data definition library of a relational database. A DTD defines the beginning and ending tags of an XML file, allowing the viewer to make sense of the data. For example, to define the data structure of a car dealership, we might have the following DTD:
<!ELEMENT Inventory (Car*)>
<!ELEMENT Car ( Make, Model, Color, Owner*)>
<!ELEMENT Make (#PCDATA)>
<!ELEMENT Model (#PCDATA)>
<!ELEMENT Color (#PCDATA)>
<!ELEMENT Owner (#PCDATA)>
The XML file displaying this data reads:
<?xml version=”1.0″ ?>
<!DOCTYPE Inventory PUBLIC “.” “Inventory.dtd” >
<Inventory>
<Car>
<Make>Honda</Make>
<Model>Civic</Model>
<Color>Red</Color>
<Owner>Jack Scanly</Owner>
<Owner>Jane Scanly</Owner>
</Car>
<Car>
<Make>Nissan</Make>
<Model>Maxima</Author>
<Color>Black</Color>
<Owner>Mike Smith</Owner>
</Car>
</Inventory>
The preceding code shows the current car inventory for a very small car dealership. Two cars populate the dataset, one red Honda Civic and one black Nissan Maxima. Note that the Civic has two owners in the database. The DTD specification allows for multiple owners by including an asterisk (*) next to the “Owner” name:
<!ELEMENT Car ( Make, Model, Color, Owner*)>
The preceding example gives an overview of the XML language and how it looks and functions—at least enough information so that you will know it when you see it. For the most part, the XML language is too new for us to cite specific security risks, but you should be prepared to identify the risks as the language matures and the hackers of the world begin to attack it.
XHTML
It’s file extension is .xhtml According to the W3C, HTML 4 is the final release of the ubiquitous language as we know it. The next version of HTML is being reformulated to include the XML language’s definition and structure. In other words, HTML and XML are being combined to form the XHTML language and give the W3C continued control over its design. The complete XHTML 1.1 specification can be found at http://www.w3.org/TR/xhtml11/. As with XML, XHTML is in its infancy, and its security hasn’t been tested to any significant degree.
Perl
It’s file extension is .pl or anything. The Practical Extraction and Report Language (Perl) is a high-level, (often considered a scripting) programming language written by Larry Wall in 1987. It is arguably the most ported scripting language to date with versions for AS/400, Windows 9x/NT/2000/XP, OS/2, Novell Netware, IBM’s MVS, Cray supercomputers, Digital’s VMS, Linux, Tandem, HP’s MPE/ix, MacOS, and all versions of Unix. The portability of the Perl language, combined with its low price (it’s free!) and robustness, has created a truly ubiquitous language for the Internet and is largely responsible for the Internet’s tremendous growth.
Perl is remarkably robust and flexible, because it can be written to accommodate server-side actions, scripted to perform functions locally on a system, or used to create entire standalone applications such as majordomo, the universal mail list manager. However, its primary use is handling the server-side scripting of Web sites. Security never has been a fundamental component of the language. As a result, many security vulnerabilities are present in Web sites that utilize the Perl language. But, if you use Perl there are ways to harden it.
Perl code can range from simple to overly complex in its design. To familiarize you with the look and feel of Perl, we present a script to run on the command line and display the words “We love Perl!” There are many ways to accomplish this task, but here is a simple one:
# Perl script to pass a message to the script
# and print it to the screen.
#
@parms = @ARGV;
$msg = $parms[0];
print “$msg!\n”;
Let’s break this code snippet down:
@parms = @ARGV;
This line takes in the parameters from the command line and inserts them into the @parms array:
$msg = $parms[0];
This line assigns the variable $msg to an element of the @parms array:
Print “$msg!\n”;
This line prints out the contents of the variable $msg and appends an exclamation mark and newline character to the screen. Running the program from the command line gives the following output:
C:\temp>perl love-perl.pl “We love Perl”
We love Perl!
C:\temp>
PHP
It’s file extensions are .php, .php3, and potentially no extension. Although there have been a number of PHP language authors, the origins of PHP start with Rasmus Lerdorf. He originally wrote the first PHP parsing engine in 1995 as a Perl CGI program, which he called “Personal Home Page,” or simply PHP. His original purpose was to log visitors to his résumé page on the Web. He later rewrote the whole thing in C and made it much larger, enhancing the program with greater parsing capabilities and adding database connectivity. Over the years, many other programmers have contributed to the development of PHP. To familiarize you with PHP code, we present the following code snippet, which uses the PHP command echo to display the string “Hello World!” in the user’s Web browser:
<!– PHP Example in HTML
<!– Prints “PHP Example: Hello World!” to the browser
<html><head><title> PHP Example</title></head>
<?php
echo “<br><h1>Hello World!<br></h1>”;
?>
</html>
PHP is much like Perl in that you can use the language in line with the HTML. Lines 1 and 2 are HTML comments, indicated by the <!– tags. Line 3 is strictly an HTML series of tags, including the <html>, <head>, and <title> tags. Lines 4, 5, and 6 are PHP code. Line 4 begins the PHP code, as indicated by the (<?) opening bracket. Line 5 is PHP’s code for outputting the string “Hello World!” on the screen. Line 6 is the (?>) closing bracket for the fourth line’s opening PHP bracket. The following is a simple example of PHP code and how you can quickly surmise the technologies in play on a Web site:
<?
// Open the SQL connection
$conn = mysql_connect(“10.1.1.1”, “sa”, “guessme”) or die(mysqlerror());
@mysql_select_db(“inventory”) or die(mysqlerror());
// SQL query
$data = mysql_query(“SELECT * FROM autos”) or die(mysqlerror());
// Print the data in HTML
print “<table>\n”;
while ($row = mysql_fetch_row ($data))
{
print “<tr>\n”;
print “<td>$row[0]</td>\n”;
print “<td>$row[1]</td>\n”;
print “</tr>\n”;
}
print “</table>\n”;
// Close the SQL connection
mysql_close($conn);
?>
CGI
It’s file extensions are .cgi, .pl. Common Gateway Interface (CGI) is one of the oldest and most mature standards on the Internet for passing information from a Web server to a program (such as Perl) and back to the Web browser in the proper format. Combined with languages such as Perl, CGI offered one of the first platforms for delivering server-side, dynamic content to the Web. Unlike ASP or PHP, CGI is not a language per se but rather a set of guidelines to be used for other languages. In fact, numerous languages can be used to create a CGI program, including Perl, C/C++, Java, Shell script language such as sh, csh, and ksh (Unix) and even Visual Basic (Windows). To help you become familiar with CGI code, we use the following trivial Perl/CGI program to display the message “Hello World!” in a Web browser:
# Perl Program to show general CGI functionality
#
# Prints “Hello World!” to the browser
print “Content-type:text/html\n\n”;
print “<html><br><head><title>Simple CGI Program</title></head><br>”;
print “<h1>Hello World!</h1><br><br></html>”;
At its heart, this code is Perl. The first print statement, using Content-type:text/html\n\n is typically required, because CGI doesn’t automatically send the HTTP headers for each request and consequently turns the responsibility for doing so over to the Perl script. Without the Content-type header, the CGI program doesn’t display the “Hello World!” text but typically displays an error message to the effect that:
The specified CGI application misbehaved by not returning a complete set of HTTP headers.
The second print statement simply sets up the HTML file with the <html>, <head>, and <title> tags. The final print statement prints the text “Hello World!” to the browser.
Environmental Variables – Finally, a discussion of CGI isn’t complete without covering environment variables. They are the conduit between the form being used by the Web browser and the back-end CGI program in that they allow certain information to pass to the running CGI program. The purpose of these variables is to better interpret the environment of the running browser.
Server-Side Includes (SSI)- HTML and SHTML
It’s file extensions are .shtml, .shtm, .stm SHTML is considered a CGI file because it typically uses server-side includes (SSI) for server-side processing. SHTML is considered the grandfather of server-side processing, because it has been around since the beginning of the Web and is still in use today (but sparingly). Like most of the languages we discussed so far, SSI are included in the HTML and get picked up by the application mappings of the Web server. The commands are acted on, with the output of the programs being sent to the browser. A simple example of an SSI is one that will return the server’s date, in GMT:
<html>
<body>
<!–#echo var=”DATE_GMT” –>
</body>
</html>
Java
It’s file extension is none. Originally called Oak, Java was written by James Gosling of Sun Microsystems in the early 1990s. Java was originally designed for smart consumer electronics devices. Even today, Java is an universal portable and functional language which has significant market. Java is an object-oriented language or one that treats all elements of a program as objects (e.g., variables, functions, subroutines, or the application itself). Even though Java is a large part of Web applications today, its source code isn’t disclosed—not easily anyway—to the client as in other Web languages. That’s because Java is compiled into class files that contain bytecode.
Client-Based Java – Client-based Java is designed to run in the processor space of the client system or end-user browser. Client-based Java code comes in two formats: applets and scripting languages.
Applets – It is indicated by the <applet> tag embedded in HTML, Java applets are downloaded and run by the client Web browser. One of the main risks with applets is that, although they are obfuscated by bytecode, the typical Java class can be downloaded separately and decompiled, allowing an attacker to view readable Java source code. This technique allows someone to search for security weaknesses in your Java code. Java applets can be written in any text file, as can most of the other languages. In fact, a simple Java applet can be written in minutes. For example, to write a “Hello World!” applet, copy the following code into a text file, naming it with a .java extension:
import java.awt.Graphics;
import java.applet.Applet;
public class HelloWorldApplet extends Applet
{
public void paint (Graphics g)
{
g.drawString(“Hello, World!”, 5, 30);
}
}
When Java is loaded into memory, it receives one stream for each method in the class. Then the bytecodes are executed by the CPU. Bytecode is a series of instructions, similar to a assembly or binary language, where each instruction consists of a one-byte opcode (the action) followed by zero or more operands. Because of bytecode, Java isn’t sent in the clear over the Internet but is obfuscated by this instruction code. Unfortunately, this obfuscation doesn’t hinder an attacker’s ability to decompile the bytecode to produce Java source code. And, with Java source code, an attacker can peer into the guts of your application, including getting a look any passwords used to connect to your servers or access your databases. Of the many Java decompilers available today, the most accurate is Java Decompiler (JAD) by Pavel Kouznetsov. In addition, the JAD engine is written in C++ making it incredibly fast.
JavaScript
JavaScript is a non-compiled, interpreted scripting language originally created by Netscape. The link between JavaScript and Java, however, is by name only. Java is compiled, object-oriented, and somewhat difficult to master. JavaScript is a simple, semi-object-oriented scripting language meant for rapid Web development. Although JavaScript has some basic object-oriented capabilities, it more closely resembles a scripting language, such as Perl. JavaScript is useful for simple tasks such as checking form data, adding HTML code on demand, and performing user specific computations such as date/time and browser considerations. A simple example of a JavaScript in action is the following code, which will display a popup box when a button is clicked:
<html>
<head>
<title>Simple JavaScript Example</title>
<script language=”Javascript”>
<!– hide for JavaScript challenged browsers
function popup()
{
alert(“Hello and welcome world!”);
}
// Done hiding –>
</script>
</head>
<h1 align=center>My JavaScipt example</h1>
<div align=center>
<form>
<input type=”button” value=”Hello World Me!” onclick=”popup()”>
</form>
This example is good for familiarizing you with the language, how it is called, and how to recognize it during your Web hacking days. Because JavaScript is client side, an attacker can bypass your input validation routines and input nonstandard data to potentially crash your application or, worse, get it to display sensitive information.
HTTP
Every Web browser and server must communicate over this protocol in order to exchange information. There have been three major versions of the protocol, all of which maintained the same fundamental structure. HTTP is a request/response stateless protocol that allows computers to talk to each other rather efficiently and carry on conversations lasting hours, days, and weeks at a time.
The standards development of HTTP was coordinated by the Internet Engineering Task Force (IETF) and the World Wide Web Consortium (W3C), culminating in the publication of a series of Requests for Comments (RFCs), most notably RFC 2616 (June 1999), which defined HTTP/1.1, the version of HTTP most commonly used today.
Technical Overview
HTTP functions as a request-response protocol in the client-server computing model. A web browser, for example, may be the client and an application running on a computer hosting a web site may be the server. The client submits an HTTP request message to the server. The server, which provides resources such as HTML files and other content, or performs other functions on behalf of the client, returns a response message to the client. The response contains completion status information about the request and may also contain requested content in its message body.
A web browser is an example of a user agent (UA). Other types of user agent include the indexing software used by search providers (web crawlers), voice browsers, mobile apps, and other software that accesses, consumes, or displays web content.
HTTP is designed to permit intermediate network elements to improve or enable communications between clients and servers. High-traffic websites often benefit from web cache servers that deliver content on behalf of upstream servers to improve response time. Web browsers cache previously accessed web resources and reuse them when possible to reduce network traffic. HTTP proxy servers at private network boundaries can facilitate communication for clients without a globally routable address, by relaying messages with external servers.
HTTP is an application layer protocol designed within the framework of the Internet Protocol Suite. Its definition presumes an underlying and reliable transport layer protocol, and Transmission Control Protocol (TCP) is commonly used. However HTTP can use unreliable protocols such as the User Datagram Protocol (UDP), for example in Simple Service Discovery Protocol (SSDP).
HTTP resources are identified and located on the network by Uniform Resource Identifiers (URIs)—or, more specifically, Uniform Resource Locators (URLs)—using the http or https URI schemes. URIs and hyperlinks in Hypertext Markup Language (HTML) documents form webs of inter-linked hypertext documents.
HTTP/1.1 is a revision of the original HTTP (HTTP/1.0). In HTTP/1.0 a separate connection to the same server is made for every resource request. HTTP/1.1 can reuse a connection multiple times to download images, scripts, stylesheets, etc after the page has been delivered. HTTP/1.1 communications therefore experience less latency as the establishment of TCP connections presents considerable overhead.
HTTP Request methods
HTTP defines methods (sometimes referred to as verbs) to indicate the desired action to be performed on the identified resource. What this resource represents, whether pre-existing data or data that is generated dynamically, depends on the implementation of the server. Often, the resource corresponds to a file or the output of an executable residing on the server. The HTTP/1.0 specification defined the GET, POST and HEAD methods and the HTTP/1.1 specification added 5 new methods: OPTIONS, PUT, DELETE, TRACE and CONNECT. By being specified in these documents their semantics are well known and can be depended upon. Any client can use any method and the server can be configured to support any combination of methods. If a method is unknown to an intermediate it will be treated as an unsafe and non-idempotent method. There is no limit to the number of methods that can be defined and this allows for future methods to be specified without breaking existing infrastructure. For example, WebDAV defined 7 new methods and RFC 5789 specified the PATCH method.
- GET – Requests a representation of the specified resource. Requests using GET should only retrieve data and should have no other effect. (This is also true of some other HTTP methods.) The W3C has published guidance principles on this distinction, saying, “Web application design should be informed by the above principles, but also by the relevant limitations.”
- HEAD – Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.
- POST – Requests that the server accept the entity enclosed in the request as a new subordinate of the web resource identified by the URI. The data POSTed might be, for example, an annotation for existing resources; a message for a bulletin board, newsgroup, mailing list, or comment thread; a block of data that is the result of submitting a web form to a data-handling process; or an item to add to a database.
- PUT – Requests that the enclosed entity be stored under the supplied URI. If the URI refers to an already existing resource, it is modified; if the URI does not point to an existing resource, then the server can create the resource with that URI.
- DELETE – Deletes the specified resource.
- TRACE – Echoes back the received request so that a client can see what (if any) changes or additions have been made by intermediate servers.
- OPTIONS – Returns the HTTP methods that the server supports for the specified URL. This can be used to check the functionality of a web server by requesting ‘*’ instead of a specific resource.
- CONNECT – Converts the request connection to a transparent TCP/IP tunnel, usually to facilitate SSL-encrypted communication (HTTPS) through an unencrypted HTTP proxy.
Status codes
In HTTP/1.0 and since, the first line of the HTTP response is called the status line and includes a numeric status code (such as “404”) and a textual reason phrase (such as “Not Found”). The way the user agent handles the response primarily depends on the code and secondarily on the other response header fields. Custom status codes can be used since, if the user agent encounters a code it does not recognize, it can use the first digit of the code to determine the general class of the response.
The standard reason phrases are only recommendations and can be replaced with “local equivalents” at the web developer’s discretion. If the status code indicated a problem, the user agent might display the reason phrase to the user to provide further information about the nature of the problem. The standard also allows the user agent to attempt to interpret the reason phrase, though this might be unwise since the standard explicitly specifies that status codes are machine-readable and reason phrases are human-readable. HTTP status code is primarily divided into five groups for better explanation of request and responses between client and server as named: Informational 1XX, Successful 2XX, Redirection 3XX, Client Error 4XX and Server Error 5XX.
HTTP Session
An HTTP session is a sequence of network request-response transactions. An HTTP client initiates a request by establishing a Transmission Control Protocol (TCP) connection to a particular port on a server (typically port 80, occasionally port 8080). An HTTP server listening on that port waits for a client’s request message. Upon receiving the request, the server sends back a status line, such as “HTTP/1.1 200 OK”, and a message of its own. The body of this message is typically the requested resource, although an error message or other information may also be returned.
The request response model of HTTP works as per the model below
HTTPS (HTTP over SSL)
HTTPS is a protocol used for encrypted traffic within an HTTP stream. The entire message is encrypted when Secure Sockets Layer (SSL) is used. Many versions of SSL and its related protocols (Transport Layer Security, TLS, and RFC2246) are available, including SSLv1, SSLv2, and SSLv3. And to make things even more confusing, SSL offers a variety of choices for the encryption standard used within a particular version of SSL. For example, with SSLv3, you can choose from DES to RSA (RC2 and RC4).
HTTPS provides authentication of the website and associated web server that one is communicating with, which protects against man-in-the-middle attacks. Additionally, it provides bidirectional encryption of communications between a client and server, which protects against eavesdropping and tampering with and/or forging the contents of the communication. In practice, this provides a reasonable guarantee that one is communicating with precisely the website that one intended to communicate with (as opposed to an impostor), as well as ensuring that the contents of communications between the user and site cannot be read or forged by any third party.
Historically, HTTPS connections were primarily used for payment transactions on the World Wide Web, e-mail and for sensitive transactions in corporate information systems. In the late 2000s and early 2010s, HTTPS began to see widespread use for protecting page authenticity on all types of websites, securing accounts and keeping user communications, identity and web browsing private.
URL
A Uniform Resource Locator (URL) is an abstract identification that locates a resource on a Web server. A URL contains the following information:
- Protocol – Specifies the Internet protocol to access a resource. The abstract encompasses FTP, Gopher, and HTTP Internet protocols.
- Network Endpoint – Internet address of Internet Information Server and protocol port number
- Resource Location – Path information to locate a resource on Internet Information Server or web server.
URL syntax is
{service}:://{host}[:port]/[path/…/] [file name]
Required parameters are surrounded by {}. Optional parameters are surrounded by []. Other characters are mandatory separators.
- Service is a required field. Web servers support FTP, Gopher, and HTTP services.
- Host is a required field. This field is the host name or IP address of the Internet Information Server.
- Port is an optional field. This field is an abstraction used by the network and transport layers to select a service on the server. This field is not frequently used. It may be specified if the service is available on a nonstandard protocol port number.
- Path is an optional field. It specifies URL resource location. A path without a file name following must always end with a / character.
The combination of host and port is a network endpoint. An example of a URL is
http://www.infomax.com:8080/welcome.htm
The http: component is the service. The http://www.infomax.com:8080 component is the network endpoint. The /welcome.htm component is the resource location.