Tweak3D.Net
Save?
Not Registered Yet? Go here.
Xoxide Computer Mods Contact the webmaster to purchase this spot on the web site
Join The Cult Tweakers Image Gallery Donate Search Today's Posts Mark Forums Read

Welcome to Tweak3D

This is an open forum about tech stuff, games, and cars. Register to access all forums, and to hide this ad and others.

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 04-04-2007, 09:36 AM   permanent link to #1
 
JustinL's Avatar
JustinL
Tweak Minion
 
Join Date: Jan 2005
Posts: 1,519 (#65 )
Thanked 3x in 2 posts

This is the last assignment for my INTRO to c++ class. wtf is this guy smoking???


Quote:
Due Mon. Apr. 23, 2007, midnight (35 points). This program will be a dictionary based text file compressor. Most English text files never use the characters with ASCII codes 128 through 255. Therefore, you can replace the 128 most common words with one of these characters. The output will be shorter because these common words take only one byte of space.

The decoder will need to know how to translate these characters back to the original words. To do that, you have to store the dictionary in the compressed file. You can do that as a list of up to 128 words, one per line, representing the codes for 128, 129, ..., 255 followed by a blank line to mark the end.

Of course no space is saved by coding a one-character word like "a" or "I". Also, no space is saved by assigning a code to a word unless the word appears at least twice. If it appears only once, then any savings is lost by the extra dictionary entry. The best strategy is to rank the words by score = (length-1)*(count-1) and assign codes to the 128 highest scoring words that have a score of at least 1. It is possible the dictionary could be smaller than 128 words.

Your assignment is to write a file compressor (compress.cpp) that takes 3 arguments on the command line. The first argument is either c (to compress) or d (to decompress). The second argument is the input file. The third is the output file. For example:

compress c file.txt file.zzz

reads file.txt and creates file.zzz, which should be smaller than file.txt for most text files (like this web page). The command:
compress d file.zzz file.out

should create file.out, which will be identical to file.txt. In Windows you can check this with the FC command:
fc file.txt file.out
FC: no differences encountered

For example, if file.txt is:
This is a test.
This is another test.

file.zzz will contain:
This
test
is

(128) (130) a (129).
(128) (130) another (129).

where (128), (129), etc. are actually single bytes, which will look like strange characters in a text editor.
A word is any sequence of characters containing A-Z or a-z. Words are case sensitive. "This" and "this" are two different words.

For this assignment, you may assume that the input is a text file and the ASCII codes 128-255 never occur. You may also assume that lines of text end with a carriage-return and linefeed (if in Windows) including the last line. Your program does not have to work otherwise. The following compression approach is suggested:

Read the input file twice. In the first pass, store a list of word counts in a map<string, int>.
Rank the words by score and store in a vector<string>. Make 128 passes through the map to find the highest scoring word, push it on the vector and set the count to 0. Stop when the vector is size 128 or no more words have a positive score.
Print the vector and a blank line to the output.
Close and reopen the input file.
Output any nonletters directly.
For each word, search the vector. If found, output the corresponding code. If not found, output the word.
To decompress, read the dictionary (up to the blank line) into a vector<string>, and replace any characers from the rest of the file in the range 128-255 with the corresponding word.
Extra credit
For 15 points extra credit, make sure your program works on non-text files that already contain characters 128-255 (like .exe files or Unicode text files), and where the carriage-return and linefeed characters have no special meaning. Normally, C++ in Windows outputs "\n" as "\r\n" and reads "\r\n" as "\n" to maintain compatibility with UNIX/Linux text file format ("\r" is a carriage return - ASCII 13, and "\n" is a linefeed - ASCII 10). Binary mode prevents this extra processing:
ifstream in(filename, ios::in | ios::binary);
ofstream out(filename, ios::out | ios::binary);

Also, you will need to encode characters 128-255 so that they decode properly. You can "escape" them by using one of these codes (say 128) as an escape character. Thus, you would code (130) as (128)(130) to signal to the decompressor not to decode (130) from the dictionary. You would code (128) as (128)(128). Note that binary files with lots of these characters would get larger, not smaller.
You might compare your compressor with others on some benchmarks:
Maximum Compression benchmark
Large Text Benchmark and Hutter Prize
Calgary Corpus Challenge
The last two benchmarks offer prize money, but also require binary file compression. This compression scheme does not compress very well by itself, but has the interesting property that you can further compress the compressed file with one of these other compressors. In other words:

compress c file.txt file.zzz
zip file.zzz

will often produce a smaller file than
zip file.txt

Most other compressors do not have this property. Once you compress a file with one compressor, you cannot compress it further with that or any other compressor.

Democracy is two wolves and a lamb voting on what to have for lunch , Liberty is a well armed lamb contesting the vote. -Benjamin Franklin
JustinL is online now Ignore this User   Reply With Quote
Yesterday, 05:30 PM   #2
 
 


Google is online (hide)  
Old 04-04-2007, 10:45 AM   permanent link to #2
 
Lord Kain's Avatar
Lord Kain
Keeper of the Timeline
 
Join Date: Jan 2005
Location: Canada
Posts: 4,464 (#13 )
Thanked 3x in 3 posts

fun! you have to code your own .zip program!!!

yay

Lord Kain is offline Ignore this User   Reply With Quote
Old 04-04-2007, 11:13 AM   permanent link to #3
 
jake's Avatar
jake
Vagabond
 
Join Date: Jun 2005
Location: Calgary, AB
Posts: 2,322 (#42 )
Thanked 1x in one post

This is an entry level class? With no programming prerequisites? What school is this?

I'm trying to evolve.
jake is offline Ignore this User   Reply With Quote
Old 04-04-2007, 11:33 AM   permanent link to #4
 
Wedge_'s Avatar
Wedge_
Rogue One
 
Join Date: Jan 2005
Location: Scotland
Posts: 1,016 (#88 Last month's rank was 87)
Thanked 0x in 0 posts

To be fair, he's given you a lot of help, including a fairly detailed algorithm and the restrictions on the input format you have to deal with. I'd have enjoyed doing this in one of my CS courses actually, compression is an interesting topic. If this is a class for people with little/no programming experience, it is a bit much though.

Quote:
This compression scheme does not compress very well by itself, but has the interesting property that you can further compress the compressed file with one of these other compressors.
Quote:
Most other compressors do not have this property. Once you compress a file with one compressor, you cannot compress it further with that or any other compressor.
These two statements are linked - most other compressors don't have this property because they make more of an effort to compress the data in the first place. If you had a crap compressor that just went through text files and changed "two" to 2 or something, the output would could still be compressed much further. Since this one "does not compress very well" of course the output can still be compressed...

Per Ardua Ad Astra
Wedge_ is offline Ignore this User   Reply With Quote
Old 04-04-2007, 06:36 PM   permanent link to #5
esun
Member
 
Join Date: Jan 2005
Location: Berkeley, CA, US
Posts: 33 (#216 )
Thanked 0x in 0 posts
Send a message via AIM to esun

This requires absolutely no design, only implementation, meaning you should just have to look up some syntax. That's exactly what I'd expect from an introductory programming course. Perhaps it's a little long, but it certainly isn't hard.
esun is offline Ignore this User   Reply With Quote
Old 04-04-2007, 06:40 PM   permanent link to #6
 
Stevv's Avatar
Stevv
Senior Member
 
Join Date: Jan 2005
Posts: 222 (#149 )
Thanked 0x in 0 posts

Im in my second term of C++ and I've never gotten anything that hard, I could definitely complete it but file I/O is pretty tedious with C++, and that is a pretty detailed thing hes having you do.
Stevv is offline Ignore this User   Reply With Quote
Old 04-04-2007, 08:44 PM   permanent link to #7
chapel
Jolly Bolly Fo-Folly
 
Join Date: Jul 2006
Posts: 1,666 (#58 )
Thanked 1x in one post

Well as others have said, he outlined exactly what you should do, and gives a lot of information that otherwise probably wouldnt be given in harder courses. I could code that in any language based on that outline, well any that I knew. I think a project like this is good to learn about file i/o and str manipulation. It sounds daunting because its a "compressor" of sorts, but in reality would be fairly simple to throw out.

What someone should do, that is in a C++ class currently, or even CS in general, is print that out, and ask their programming teacher about it. Like if they think its hard for the fact that its a beginning programming class and other things. I think that would be some cool info, to see what other teachers think.
chapel is offline Ignore this User   Reply With Quote
Old 04-04-2007, 09:44 PM   permanent link to #8
 
JustinL's Avatar
JustinL
Tweak Minion
 
Join Date: Jan 2005
Posts: 1,519 (#65 )
Thanked 3x in 2 posts

My GSA looked at it today and he couldnt believe it. Now, he had us do a program simular to this already (theres a big thread on it somewhere), but this is an intro class, cse 1502. AFAIK, it is suppose to give the basics to allow me to use C++ for some type of engineering I may need to do.

Democracy is two wolves and a lamb voting on what to have for lunch , Liberty is a well armed lamb contesting the vote. -Benjamin Franklin
JustinL is online now Ignore this User   Reply With Quote
Old 04-04-2007, 11:29 PM   permanent link to #9
tex
jive turkey
 
Join Date: Jan 2005
Location: ATX
Posts: 2,296 (#44 )
Thanked 0x in 0 posts

Whenever I took my intro to c++ class, that was one of our assignments towards the middle of the semester. Our final one was rediculous, using those text-file based images and doing shit with star charts. The concept is easy, but actually implementing that kind of shit sucks (thus why I am not a programmer)
tex is online now Ignore this User   Reply With Quote
Old 04-04-2007, 11:47 PM   permanent link to #10
ThatHideousStrength
Tweak Minion
 
Join Date: Jan 2005
Posts: 4,542 (#12 )
Thanked 5x in 4 posts

I am a bit blown away from that assignment. I'm in a intro C++ course right now and haven't experienced something that hard yet.
ThatHideousStrength is offline Ignore this User   Reply With Quote
Old 04-04-2007, 11:57 PM   permanent link to #11
tex
jive turkey
 
Join Date: Jan 2005
Location: ATX
Posts: 2,296 (#44 )
Thanked 0x in 0 posts

Dunno, maybe all my programming teachers have been dicks. Our final project in the 2nd course ended being over 100 pages of code, we had one month to write and debug it. I ate shit because of the stupid way I did it...
tex is online now Ignore this User   Reply With Quote
Old 04-05-2007, 06:16 PM   permanent link to #12
 
SoulAssassin's Avatar
SoulAssassin
Vagina Friendly Mayor
 
Join Date: Jan 2005
Location: Cincinnati, OH
Posts: 3,425 (#25 )
Thanked 1x in one post
Send a message via Yahoo to SoulAssassin

I remember my first assignment in intro was about like that one above...it made me not want to ever program again since I had no knowledge of c++ at all before this.
SoulAssassin is offline Ignore this User   Reply With Quote
Reply

Login to reply. Sign-up here.



1 people are viewing this thread! (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT -7. The time now is 05:08 PM.


Powered by vBulletin® Version 3.7.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.0.0
©Tweak3D.Net 1998-2008Ad Management by RedTyger