Text Comparison - word frequencies calculation URGENT
$250-750 USD
Kiszállításkor fizetve
Need it within a week...
Need to implent tool described in CollGram Profile (Bestgen & Granger 2014) [login to view URL] These measures are the average t-score and MI score for all bigrams in the student’s text, calculated based on a reference text corpus.
I have a set of 300 essays written by students at different levels, and I would like to calculate the Collgram profiles for them, based on the COCA corpus here example [login to view URL] alredy got it. The lists of bigrams and word frequencies are of course available on the website, so calculating the MI and t-score for each bigram will not be difficult, in concept.
For each pair of words (bigrams), two collocability ratios (MI and t) should be calculated, based on the frequency of constituent words. A calculation based on formulas is mathematically simple. I am attaching it in *.docx
• I am at the disposal of around 300 texts with lengths of 20-600 words written by English students (the orthography has been corrected). From each text, all of the bigrams must be extracted (punctuation symbols are the threshold of bigrams).
• The extracted bigrams must be found in the reference list (COCA corpus), which is discussed in point (1). If they are found, their two collocability ratios must be checked.
• For each text, four lists must be produced: a list of the found n-grams (of specimens and types), along with two collocability ratios, and a list of bigrams that were not found (of specimens and types).
• For each text, the following values must be produced: the average of two ratios – for the specimens and for the types separately, as well as a percentage of bigrams that have not been found in the general number of bigrams (for the specimens and types).
• The last operation should produce a nice table for the batch of text.
If really necessary I can proviede [login to view URL] license for POS tagging.
I got samples of output for an analysed file:
1 2 3 4 5 6 7 8
freq_text freq_COCA mean freq_COCA MI MI>3 t t>2,54
bigram1
bigram2
bigram3
bigram4
bigram5
Ö.
col 1 - a list of all bi-grams retrieved from a learner text (without punctuation marks)
col 2 - frequenecy of the bigram in a learner text
col 3 - frequency of the bigram in COCA. If blank, the bigram does not occur in COCA.
col 4 - mean frequency of the bigram in COCA per 1 million. If blank, the bigram does not occur in COCA.
col 5 - MI for the bigram calcualted based on COCA. If blank, the bigram does not occur in COCA.
col 6 - "*" if MI>3
col 7 - t for the bigram calcualted based on COCA. If blank, the bigram does not occur in COCA.
col 8 - "*" if t>2
Also input files may have <> tags that should be removed.
I also want to be able to load multiple files and if I load more than one to program and also get for each file analisis as above plus cummulative results as:
This only beginning I hope freelancer doing this will be willing to continue develop this tool as next projects.
Projektazonosító: #10141081
A projektről
8 szabadúszó tett átlagosan 480$ összegű árajánlatot erre a munkára
Hi, We have a team of Data Mining and Web Scraping experts. We have worked on many Data Mining techniques including Association Rule Mining, Clustering, Outlier Mining, Sentiment Analysis etc extensively in the pas Továbbiak
Hi, client. I am a C++ programmer and mathematician. Please check my Profile/RecordList and tell me details. Looking forward to your response. Thanks.
Hi, I have over 5 years of experience in Java development here is my detailed plan. 1. The application will be GUI Application 2. You will have the facility to select an individual file or a group of files or a Továbbiak