Elite News

Tuesday, April 04, 2006

Brian's Guide to Creating a Website - Part 1

Part 1 - Search Engines
Starting a Blog, even a website for that matter, isn't what it used to be. Back in the days of Geocities, Angelfire, and when FrontPage and Netscape Editor, things were so simple. I started my first website in 8th grade, back when html was all you needed to know, maybe some javascript and css. In 30 mins, I was able to get my make shift site with more pictures than text up and into search engines.

Nowdays. Times have changed, new technology, new processes, new everything. I am learning a lot as I continually try to improve this Blog both for viewers and to get it known. So far, I have only had experience programming websites or and utilities, I am new to the actual administration of one. So, my goal is to teach what I learn, as I learn. Maybe, I could give some viewers a head start with their own website.

Next week, I'll go into actual design of a website and some neat tools you can use and add to it. For now I'm going to cover search engines.

There are still some legacy search engines that use meta tags to find and describe your site. So before you even submit your site to any website, first create suitable meta tags.

This is our meta tag for example.
<meta name="keywords" content="security, technology, news, computer, it, information technology, blog, tech">
name defines the type of meta tag, in this case its "keywords."
content defines the content of the type.
In this example, I'm making "security, techn..." the searchable keywords for my site. Here is another example of a meta tag that you need.
<meta name="description" content="Security. Technology. News. Now. Up to date information focusing on Security and Technology.">
This is basically the same except that it gives my site a description.

Now on to more advanced search engines, easy one first. Yahoo . Yahoo uses the old style of site submission. You just go to http://search.yahoo.com/info/submit.html and sign up for one of their services. Simple.

Google . Google like many search engines use a web crawler . A crawler systematically searches up and down your site, through every link, in every file to find useful information to produce a description and places your site in their search. Like Yahoo, you submit your site http://www.google.com/submit_content.html and sign up. The cool part of Google, is you can use their utilities to give yourself more control and monitor your site more.

Here's where it get complicated. To get the most out of Google (and other search engines by association), I had to research how it works.

Googlebot (Google's website crawler robot) starts with your submitted link and searches it for more links. It then goes through each of those links to discover more. This whole time it's gathering data for its relevence engine and its description engine. It also goes through the folders that make up your site.

Security thought: What if you don't want Google to see your whole site? What if you don't want your whole site to be accessed through a search engine? I certainly don't want my random storage data files or test websites to be seen. So how do you stop Googlebot!?

Google, like many other crawlers like Excite abide by the robots.txt rules. http://www.robotstxt.org/wc/robots.html . robots.txt is a little file in your root folder that allows and prevents web crawlers from accessing certain folders. It must be written in a specific format and be placed in the root directory (ex http://www.elitenews.org/robots.txt) for it to work.

# <- comment symbol
User-Agent: *
# user-agent defines who this rule applies to
# * = any crawler aka wildcard
# specific crawlers can also be put here like google, or w3crawler, or happycrawler, etc
Allow: /
# allows the root dir to be searched
Disallow: /secretstuff
# stops the specific folder to be accessed
# you can have many of these 'modules' to be in one robots.txt


Security thought: Nothing stops a normal person from seeing this information. If say a hacker was trying to examine the structure of your site, they could find a lot of useful information in this file. See for yourself, visit
http://www.google.com/robots.txt
http://www.microsoft.com/robots.txt
http://www.apple.com/robots.txt
http://www.cnet.com/robots.txt
http://www.cox.net/robots.txt
the list goes on.

Say you don't want Googlebot to crawl on a single page. Put
<meta name="Googlebot" content="nofollow">
In your header and it will prevent it from looking at the links on the page.

What if its just one link you don't want it to consider. change your link to

<a href="http://www.blah.com/" rel="nofollow">unrelated to my site</a>

To help the crawler out and to make sure it discovers all of the pages you want it to, you should also make a sitemap . https://www.google.com/webmasters/sitemaps/login?hl=en Google is nice enough to monitor your site map and checks it for errors, and your site along with it. A sitemap is basically a link guide to the rest of your site. It includes all the links to the various pages of your site.

There are a few formats you can use. The two I am most familiar with is XML or a simple TXT file. Google gives an awesome example of how to use XML for your sitemap and if you don't want to make it your self, they offer a program that does it for you. Personally, I like doing everything myself manually. https://www.google.com/webmasters/sitemaps/docs/en/protocol.html

If you don't want to deal with that mess, a simple txt file will do. The only catch is you cannot manage any of the other variables that the xml version gives. To make one, just put a single link per line in a txt.

Still having trouble? Visit here, http://sitemaps.blogspot.com/ , it helped me a lot out with my errors.

I know its long, but i hope it helps.

1 Comments:

  • this is a great start! very informative, but did they have CSS back when you were in 8th grade?

    By Anonymous Anonymous , at 4/05/2006 11:02 PM



<< Home