URL stands for Uniform Resource Locator, the global 
address of documents and other resources on the World Wide Web. The 
first part of the address is called a protocol identifier and it 
indicates what protocol to use, and the second part is called a resource
 name and it specifies the IP address or the domain name where the 
resource is located. The protocol identifier and the resource name are 
separated by a colon and two forward slashes.
For example
http://www.tallysolutions.com/website/html/PartnerDetails/622894.php
The URLs above specifies a Web page that should be fetched using the HTTP protocol
Elements of a URL
Every URL is made up of some combination of the 
following: the scheme name (commonly called protocol), followed by a 
colon, then, depending on scheme, a hostname (alternatively, IP 
address), a port number, the pathname of the file to be fetched or the 
program to be run, then (for programs such as CGI scripts) a query 
string[4][5], and with HTML files, an anchor (optional) for where the 
page should start to be displayed.
Scheme
The scheme represents the protocol, and for our 
purposes will either be http or https. https represents a connection to a
 secure web server.
<scheme>:<scheme-specific-part>
A URL contains the name of the scheme being used
 (<scheme>) followed by a colon and then a string (the 
<scheme-specific-part>) whose interpretation depends on the 
scheme. Scheme names consist of a sequence of characters. The lower case
 letters "a"--"z", digits, and the characters plus ("+"), period ("."), 
and hyphen ("-") are allowed. For resiliency, programs interpreting URLs
 should treat upper case letters as equivalent to lower case in scheme 
names (e.g., allow "HTTP" as well as "http"). 
Host 
The hostname part of the URL should be a valid 
Internet hostname such as www.tallysolutions.com. It can also be an IP 
address such as 204.29.207.217
Port Number
The port number is optional. It's not necessary if the service is running on the default port, 80 for http servers.
Path Information
The path points to a particular directory on the 
specified server. The path is relative to the document root of the 
server, not necessarily to the root of the file system on the server. In
 general a server does not show its entire file system to clients. 
Indeed it may not really expose a file system at all. (Amazon's URLs, 
for example, mostly point into a database.) Rather it shows only the 
contents of a specified directory. This directory is called the server 
root, and all paths and filenames are relative to it. Thus on a Unix 
workstation all files that are available to the public might be in 
/var/public/html, but to somebody connecting from a remote machine this 
directory looks like the root of the file system.
The filename points to a particular file in the 
directory specified by the path. It is often omitted in which case it is
 left to the server's discretion what file, if any, to send. Many 
servers will send an index file for that directory, often called 
index.html. Others will send a list of the files in the directory. 
Others may send an error message.
Fragment identifier
The fragment identifier is used to reference a named anchor or ID
 in an HTML document. A named anchor is created in HTML document with an
 A element with a NAME attribute like this one:<a name="anchor" >Here is the content you're after...</a>
Absolute and Relative URLs
Absolute URL
URLs that include the hostname are called absolute URLs. An example of an absolute URL is:
http://localhost/cgi/script.cgi.
Relative URL 
URLs without a scheme, host, or port are called relative URLs. These can be further broken down into full and relative paths:
Full paths
Relative URLs with an absolute path are sometimes
 referred to as full paths (even though they can also include a query 
string and fragment identifier). Full paths can be distinguished from 
URLs with relative paths because they always start with a forward slash.
 Note that in all these cases, the paths are virtual paths, and do not 
necessarily correspond to a path on the web server's filesystem. An 
example of an absolute path is /index.html.
Relative paths
Relative URLs that begin with a character other 
than a forward slash are relative paths. Examples of relative paths 
include script.cgi and ../images/photo.jpg.
URL Character Encoding Issues
URLs are sequences of characters, i.e., letters, 
digits, and special characters. A URLs may be represented in a variety 
of ways: e.g., ink on paper, or a sequence of octets in a coded 
character set. The interpretation of a URL depends only on the identity 
of the characters used.
In most URL schemes, the sequences of characters 
in different parts of a URL are used to represent sequences of octets 
used in Internet protocols. For example, in the ftp scheme, the host 
name, directory name and file names are such sequences of octets, 
represented by parts of the URL. Within those parts, an octet may be 
represented by the chararacter which has that octet as its code within 
the US-ASCII [20] coded character set.
In addition, octets may be encoded by a character
 triplet consisting of the character "%" followed by the two hexadecimal
 digits (from "0123456789ABCDEF") which forming the hexadecimal value of
 the octet. (The characters "abcdef" may also be used in hexadecimal 
encodings.)
Octets must be encoded if they have no 
corresponding graphic character within the US-ASCII coded character set,
 if the use of the corresponding character is unsafe, or if the 
corresponding character is reserved for some other interpretation within
 the particular URL scheme.
No corresponding graphic US-ASCII
URLs are written only with the graphic printable 
characters of the US-ASCII coded character set. The octets 80-FF 
hexadecimal are not used in US-ASCII, and the octets 00-1F and 7F 
hexadecimal represent control characters; these must be encoded.
Unsafe
Characters can be unsafe for a number of reasons.
 The space character is unsafe because significant spaces may disappear 
and insignificant spaces may be introduced when URLs are transcribed or 
typeset or subjected to the treatment of word-processing programs. The 
characters < and > are unsafe because they are used as the 
delimiters around URLs in free text; the quote mark (""") is used to 
delimit URLs in some systems. The character "#" is unsafe and should 
always be encoded because it is used in World Wide Web and in other 
systems to delimit a URL from a fragment/anchor identifier that might 
follow it. The character "%" is unsafe because it is used for encodings 
of other characters. Other characters are unsafe because gateways and 
other transport agents are known to sometimes modify such characters. 
These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`".
All unsafe characters must always be encoded 
within a URL. For example, the character "#" must be encoded within URLs
 even in systems that do not normally deal with fragment or anchor 
identifiers, so that if the URL is copied into another system that does 
use them, it will not be necessary to change the URL encoding.
Reserved
Many URL schemes reserve certain characters for a
 special meaning: their appearance in the scheme-specific part of the 
URL has a designated semantics. If the character corresponding to an 
octet is reserved in a scheme, the octet must be encoded. The characters
 ";", "/", "?", ":", "@", "=" and "&" are the characters which may 
be reserved for special meaning within a scheme. No other characters may
 be reserved within a scheme.
Usually a URL has the same interpretation when an
 octet is represented by a character and when it encoded. However, this 
is not true for reserved characters: encoding a character reserved for a
 particular scheme may change the semantics of a URL.
Thus, only alphanumerics, the special characters 
"$-_.+!*'(),", and reserved characters used for their reserved purposes 
may be used unencoded within a URL. On the other hand, characters that 
are not required to be encoded (including alphanumerics) may be encoded 
within the scheme-specific part of a URL, as long as they are not being 
used for a reserved purpose. 
 
 
No comments:
Post a Comment