Text string processing upgrade

Befejezett Kiadva: Jul 26, 2010 Kiszállításkor fizetve
Befejezett Kiszállításkor fizetve

I recently used VWorker to upgrade Delphi code (v 10) to handle unicode strings, One program compiled from the development environment is Collocate, a working program that among other things gives ngram frequency lists ("a lot of" "in the end" etc). I want to add more options so that (a) some variables are added (e.g., "a * of") and (b) a concordance display of selected strings is provided. The source code contains files for ngrams, concordance etc. A demo version of the software is attached.

## Deliverables

Collocate Upgrade -- Delphi 10 1. Under the Full Extract menu, I would like to add two more commands. P-frame, and Lexical Bundles, so that we have: N-gram P-frame Lexical Bundles Full Extract The two new commands are variant of n-gram (i) P-frame ??" will have the same interface as N-gram except that the minimum span is now 3 not 2. There is also a checkbox labeled “Internal variables only?? P-frame is the same as ngram but with a variable slot and so if the user selects a span of 3, then the software, rather counting the trigram A B C, calculates the most common sequences A * C and presents them in a list in the usual way. A common p-frame will be “a * of?? If Internal variables only is not checked, then the software will also calculate A B * and * B C and include the results in the one list along with A * C. Let’s assume “internal variables only?? and start with a trigram list. That means that we are looking A * C. Here is a piece of a trigram list going to be a lot of i think that one of the in terms of i don't know and i think we need to i think we you want to be able to i don't think are going to -- i mean we're going to we want to but i think we have to of the test he’s going to The first item is “going to be??. We want to count all the instances of “going some-word be??. (It is likely to be only one.) The next item is “a lot of?? and so we are looking for “a some-word of??. Even in this list we have two instances of that. We also have two instances of “we * to??. I don’t think there are any other p-frames in this list. 2 a * of 2 we * to If “internal variables only?? is selected, then the list of the form A * C is returned (taking account of any minimum frequency threshold that is set). If I were doing the programming, I’d sort the list by position 1 and sort secondarily by position 3. And then iterate through the lines looking for a match in position 2. There may be more efficient ways of doing that. If “internal variables only?? is not selected, then we are not finished. In this case I am also looking for * B C and AB *. For the * B C pattern we have 3 * going to For the A B * pattern we have 2 I don’t * So the list that is returned will be something like: 3 * going to 2 a * of 2 I don’t * 2 we * to If the span is 4 and Internal variables only is checked, then the list will contain A * C D, A B * D. There won’t be any Mutual Information option for the p-frame list. (ii) Lexical Bundles ??" This also has the same interface as n-gram (i.e., Span) except that for the following: There are two or three threshold values that transform a ngram list into a lexical bundles list. There is an option “Range?? The user can determine how widespread the ngram must be in order to be included in the lexical bundles list. We are doing this in a simple way by referring to the number of files. Say that there are 40 files in the corpus, then the user will be selecting some number less than or to equal to 40. If the user enters 30, then the ngram must occur in at least 30 separate files to be counted as a lexical bundle. We have the label “Range?? a drop-down list to include a number ??" the “out of?? followed by the total number of files in the corpus. There is an option “Minimum Frequency?? This will initially show whatever value is set in Counts Options. (There is just one setting of Minimum Frequency for the program ??" and there are the two places where that value can be altered). There is an option Mutual Information. This is itself optional and so we need a checkbox next to it. If the checkbox is selected, then the n-grams must have a MI value equal or greater than that value to be included in the Lexical Bundles list. 2. Colour highlighting If the user does a search using Extract and uses + or > then the words specified in the search may be separated by other words. Thus the user might type “test + result?? (usually with the minimum frequency set to 1). The instances of the search terms (e.g., test and result) should show up in blue. 3. Copy option The user should be able to copy and paste items from a table. This means that the user can select one or multiple lines and then use Control C or a Copy command in Display. The results are copied to the clipboard with a tab between fields. 4. Limited concordance display For any of the results tables, the user can double-click on that line or a series of lines and get a concordance window of the results. For example, a 4gram list may show that there are 43 instances of “one of the things??. Selecting and then double -clicking the selection will bring up a concordance window showing the 43 instances. Here we will want to use some of the code from MonoConc, which is a concordancer in the same source code. so that the lines can be sorted and saved. Selecting several lines will also be possible. When the concordance window is active, we need another menu Concordance. This contains commands for Sorting and a Save and a Print command-- using the MonoConc code from the development environment. A demo of Collocate is attached; Use Load Corpus File in the File menu. This version works with ansi text. The results are limited.

Delphi Mérnöki munka Microsoft Projektmenedzsment Szoftverarchitektúra Szoftvertesztelés Windows Asztal

Projektazonosító: #3602521

A projektről

3 ajánlat Távolról teljesíthető projekt Utoljára aktív: Sep 3, 2010

Odaítélve:

MaximVolobuev

See private message.

$85 USD 14 napon belül
(79 értékelés)
6.2

3 szabadúszó tett átlagosan 293$ összegű árajánlatot erre a munkára

mdkass

See private message.

$283.05 USD 14 napon belül
(86 vélemény)
6.9
dongcn

See private message.

$510 USD 14 napon belül
(0 vélemény)
0.0