频道直达 - 专题 - 新闻 - 技巧 - 组网 - 开发 - 安全 - web编程 - 图像 - 操作系统 - 数据库 - 教育 - 旅游 - 健康 - 时尚 - 驱动 - 软件 - 游戏 - 多媒体 - ERP - 讨论组

Data Structures with .NET - Part 1: An Introduction to Data Structures

来源: 作者: 出处:巧巧读书 2006-05-22 进入讨论组

An Extensive Examination of Data Structures

Part 1: An Introduction to Data Structures

Scott Mitchell
4GuysFromRolla.com

October 2003

Summary: This article kicks off a six-part series that focuses on important data structures and their use in application development. We'll examine both built-in data structures present in the .NET Framework, as well as essential data structures we'll have to build ourselves. This first installment focuses on defining what data structures are, how the efficiency of data structures is analyzed, and why this analysis is important. In this article, we'll also examine the Array and ArrayList, two of the most commonly used data structures present in the .NET Framework. (12 printed pages)

Contents

Introduction
Analyzing the Performance of Data Structures
Everyone's Favorite Linear, Direct Access, Homogeneous Data Structure
The ArrayList: a Heterogeneous, Self-Redimensioning Array
Conclusion

Introduction

Welcome to the first in a six-part series on using data structures in .NET. Throughout this article series we will be examining a variety of data structures, some of which are included in the .NET Framework Base Class Library and others that we'll build ourselves. If you're unfamiliar with the term, data structures are abstract structures, or classes, that are used to organize data and provide various operations upon their data. The most common and likely well-known data structure is the array, which contains a contiguous collection of data items that can be accessed by an ordinal index.

Before jumping into the content for this article, let's first take a quick peek at the roadmap for this six-part article series, so that you can see what lies ahead. If there are any topics you think are missing from this outline, I invite you to e-mail me at mitchell@4guysfromrolla.com and share your thoughts. Space permitting, I'll be happy to add your suggestions to the appropriate installment or, if needed, add a seventh part to the series.

In this first part of the six-part series, we'll look at why data structures are important, and their effect on the performance of an algorithm. To determine a data structure's effect on performance, we'll need to examine how the various operations performed by a data structure can be rigorously analyzed. Finally, we'll turn our attention to two data structures present in the .NET Framework—the Array and ArrayList. Chances are you've used both of these data structures in past projects. In this article, we'll examine what operations they provide and the efficiency of these operations.

In Part 2, we'll explore the ArrayList class in more detail and examine its counterparts, the Queue class and Stack class. Like the ArrayList, both the Queue and Stack classes store a contiguous collection of data and are data structures available in the .NET Framework Base Class Library. However, unlike an ArrayList from which you can retrieve any data item, Queues and Stacks only allow data to be accessed in a predetermined sequential order. We'll examine some applications of Queues and Stacks, and see how to implement both of these classes by extending the ArrayList class. After examining Queues and Stacks, we'll look at HashTables, which allow for direct access like an ArrayList, but store data indexed by a string key.

While ArrayLists are ideal for directly accessing and storing contents, they are suboptimal candidates when the data needs to be searched. In Part 3, we'll examine the binary search tree data structure, which provides a much more efficient means for searching than the ArrayList. The .NET Framework does not include any built-in binary search tree data structures, so we will have to build our own.

The efficiency of searching a binary search trees is sensitive to the order with which the data was inserted into the tree. If the data was inserted in sorted or near-sorted order, the binary search tree loses virtually all of its efficiency advantages over the ArrayList. To combat this issue, in Part 4 we'll examine an interesting randomized data structure—the SkipList. SkipLists provide the efficiency of searching a binary search tree, but without the sensitivity to the order with which data is entered.

In Part 5 we'll turn our attention to data structures that can be used to represent graphs. A graph is a collection of nodes, with a set of edges connecting the various nodes. For example, a map can be visualized as a graph, with cities as nodes and the highways between them as edged between the nodes. Many real-world problems can be abstractly defined in terms of graphs, thereby making graphs an often-used data structure.

Finally, in Part 6 we'll look at data structures to represent sets and disjoint sets. A set is an unordered collection of items. Disjoint sets are a collection of sets that have no elements in common with one another. Both sets and disjoint sets have many uses in everyday programs, which we'll examine in detail in this final part.

Analyzing the Performance of Data Structures

When thinking about a particular application or programming problem, many developers (myself included) find themselves most interested in writing the algorithm to tackle the problem at hand, or adding cool features to the application to enhance the user's experience. Rarely, if ever, will you hear someone excited about what type of data structure they are using. However, the data structures used for a particular algorithm can greatly impact its performance. A common example is finding an element in a data structure. With an array, this process takes time proportional to the number of elements in the array. With binary search trees or SkipLists, the time required is sub-linear. When searching large amounts of data, the data structure chosen can make a difference in the application's performance that can be visibly measured in seconds or even minutes.

Since the data structure used by an algorithm can greatly affect the algorithm's performance, it is important that there exists a rigorous method by which to compare the efficiency of various data structures. What we, as developers utilizing a data structure, are primarily interested in is how the data structures performance changes as the amount of data stored increases. That is, for each new element stored by the data structure, how are the running times of the data structure's operations effected?

Consider a scenario in which you have a program that uses the System.IO.Directory.GetFiles(path) method to return the list of the files in a specified directory as a string array. Now, imagine that you wanted to search through the array to determine if an XML file existed in the list of files (namely one whose extension was .xml). One approach to do this would be to scan through the array and set some flag once an XML file was encountered. The code might look like so:

using System;

using System.Collections;

using System.IO;

 

public class MyClass

{

   public static void Main()

   {

      string [] fs = Directory.GetFiles(@"C:\Inetpub\wwwroot");

      bool foundXML = false;

      int i = 0;

      for (i = 0; i < fs.Length; i++)

         if (String.Compare(Path.GetExtension(fs[i]), ".xml", true) == 0)

         {

            foundXML = true;

            break;

         }

  

     if (foundXML)

        Console.WriteLine("XML file found - " + fs[i]);

     else

        Console.WriteLine("No XML files found.");

     

   }

}

Here we see that in the worst-case, when there is no XML file or the XML file is the last file in the list, we have to search through each element of the array exactly once. To analyze the array's efficiency at sorting, we must ask ourselves the following, "Assume that I have an array with n elements. If I add another element, so the array has n + 1 elements, what is the new running time?" (The term running time, despite its name, does not measure the absolute time it takes the program to run, but rather, it refers to the number of steps the program must perform to complete the given task at hand. When working with arrays, typically the steps considered are how many array accesses one needs to perform.) To search for a value in an array, we need to potentially visit every array value, so if we have n + 1 array elements, we might have to perform n + 1 checks. That is, the time it takes to search an array is linearly proportional to the number of elements in the array.

The sort of analysis described here is called asymptotic analysis, as it examines how the efficiency of a data structure changes as the data structure's size approaches infinity. The notation commonly used in asymptotic analysis is called big-Oh notation. The big-Oh notation to describe the performance of searching an array would be denoted as O(n). The large script O is where the terminology big-Oh notation comes from, and the n indicates that the number of steps required to search an array grows linearly as the size of the array grows.

A more methodical way of computing the asymptotic running time of a block of code is to do follow these simple steps:

1.                  Determine the steps that constitute the algorithm's running time. As aforementioned, with arrays, typically the steps considered are the read and write accesses to the array. For other data structures the steps might differ. Typically, you want to concern yourself with steps that involve the data structure itself, and not simple, atomic operations performed by the computer. That is, with the block of code above, I analyzed its running time by only counting how many times the array needs to be accessed, and did not bother worrying about the time for creating and initializing variables or the check to see if the two strings were equal.

2.                  Find the line(s) of code that perform the steps you are interested in counting. Put a 1 next to each of those lines.

3.                  For each line with a 1 next to it, see if it is in a loop. If so, change the 1 to 1 times the maximum number of repetitions the loop may perform. If you have two or more nested loops, continue the multiplication for each loop.

4.                  Find the largest single term you have written down. This is the running time.

Let's apply these steps to the block of code above. We've already identified that the steps we're interested in are the number of array accesses. Moving onto step 2 note that there are two lines on which the array, fs, is being accessed—as a parameter in the String.Compare() method and in the Console.WriteLine() method, so mark a 1 next to each line. Now, applying step 3 notice that the access to fs in the String.Compare() method occurs within a loop that runs at most n times (where n is the size of the array). So, scratch out the 1 in the loop and replace it with n. Finally, we see that the largest value is n, so the running time is denoted as O(n).

O(n), or linear-time, represents just one of a myriad of possible asymptotic running times. Others include O(log2 n), O(n log2 n), O(n2), O(2n), and so on. Without getting into the gory mathematical details of big-Oh, the lower the term inside the parenthesis for large values of n, the better the data structure's operation's performance. For example, an operation that runs in O(log n) is more efficient than one that runs in O(n) since log n < n.

Note   In case you need a quick mathematics refresher, loga b = y is just another way to write ay = b. So, log2 4 = 2, since 22 = 4. Similarly, log2 8 = 3, since 23 = 8. Clearly, log2 n grows much slower than n alone, because when n = 8, log2 n = 3. In Part 3 we'll examine binary search trees whose search operation provides an O(log2 n) running time.

Throughout this article series, we'll be computing each new data structure and its operations asymptotic running time and comparing it to the running time for similar operations on other data structures.

Everyone's Favorite Linear, Direct Access, Homogeneous Data Structure—The Array

Arrays are one of the simplest and most widely used data structures in computer programs. Arrays in any programming language all share a few common properties:

·                     The contents of an array are stored in contiguous memory.

·                     All of the elements of an array must be of the same type; hence arrays are referred to as homogeneous data structures.

·                     Array elements can be directly accessed. (This is not necessarily the case for many other data structures. For example, in part 4 of this article series we'll examine a data structure called the SkipList. To access a particular element of a SkipList you must search through other elements until you find the element for which you're looking. With arrays, however, if you know you want to access the ith element, you can simply use one line of code: arrayName[i].)

The common operations performed on arrays are:

·                     Allocation

·                     Accessing

·                     Redimensioning

When an array is initially declared in C# it has a null value. That is, the following line of code simply creates a variable named booleanArray that equals null:

bool [] booleanArray;

Before we can begin to work with the array, we must allocate a specified number of elements. This is accomplished using the following syntax:

booleanArray = new bool[10];

Or more generically:

arrayName = new arrayType[allocationSize];

This allocates a contiguous block of memory in the CLR-managed heap large enough to hold the allocationSize number of arrayTypes. If arrayType is a value type, then allocationSize number of unboxed arrayType values are created. If arrayType is a reference type, then allocationSize number of arrayType references are created. (If you are unfamiliar with the difference between reference and value types and the managed heap versus the stack, check out Understanding .NET's Common Type System.)

To help hammer home how the .NET Framework stores the internals of an array, consider the following example:

bool [] booleanArray;

FileInfo [] files;

 

booleanArray = new bool[10];

files = new FileInfo[10];

Here, the booleanArray is an array of the value type System.Boolean, while the files array is an array of a reference type, System.IO.FileInfo. Figure 1 shows a depiction of the CLR-managed heap after these four lines of code have executed.

Figure 1. The contents of an array are laid out contiguously in the managed heap.

The thing to keep in mind is that the ten elements in the files array are references to FileInfo instances. Figure 2 hammers home this point, showing the memory layout if we assign some of the values in the files array to FileInfo instances.

Figure 2. The contents of an array are laid out contiguously in the managed heap.

All arrays in .NET provide allow their elements to both be read and written to. The syntax for accessing an array element is:

// Read an array element

bool b = booleanArray[7];

 

// Write to an array element

booleanArray[0] = false;

The running time of an array access is denoted O(1) because it is constant. That is, regardless of how many elements are stored in the array, it takes the same amount of time to lookup an element. This constant running time is possible solely because an array's elements are stored contiguously, hence a lookup just requires knowledge of the array's starting location in memory, the size of each array element, and the element to be indexed.URL查看 http://www.qqread.com/dotnet/q954112002.html 更多文章 更多内容请看.NET移动与嵌入式技术.NET开发手册专题,或进入讨论组讨论。

收藏此文】【 】【打印】【关闭
相关图文阅读
频道图文推荐
健 康 咨 询
时 尚 咨 询
巧巧读书宗旨
相关专题
讨论组问题推荐
站内各频道最新更新文档
站内最新制作专题
热门关键字导读
Photoshop教 程照片处理 照片制作 PS快捷键 抠图
计 算 机 故 障XP系统修复
艺 术 与 设 计设计 流媒体 设计欣赏 边框
计 算 机 安 全ARP
站内频道文章精选
巧巧电脑频道编辑信箱  告诉我们您想看的专题或文章