First off, after successfully adapting Google Refine to my company's needs, I now go full circle and adapt the software I wrote for the company to be open-source and useable by all. This task has lead to met implementing server-side SQLite using GWT framwork (and the sqlitejdbc JAR). It's going surprisingly well. It's probably not the best most secure idea in the world, but at least where I am, it'll be behind a log-in screen.
After this I may get involved in contributing to Google Refine's development. It depends on my company's needs, of course. All and all, this semester has been one of the most productive of my life.
Friday, December 16, 2011
Wednesday, December 7, 2011
Rebranding Google Refine
Google Refine is cool with you modifying their code and redistributing it... as long as you don't call it "Google Refine." So, I'm going through the process of rebranding Google Refine as BIORefine.
The first thing I did was whip up a nifty little replacement for the Google Refine logo.
It was created with two different heights 30px and 40px. it goes in the grefine/main/webapp/modules/core/images folder.
grefine/main/webapp/modules/core/index.vt is edited to change Google Refine to BIORefine.
There are a handful of other files to be changed (mostly the .html files in grefine/main/webapp/modules/core/scripts/index/ such as create-project-ui-source-selection.html).
Edit: I'll probably update this guide as I find more that I've missed (like about.html - although I did leave all the text in, just added that BIORefine is a modified version of Google Refine).
The first thing I did was whip up a nifty little replacement for the Google Refine logo.
It was created with two different heights 30px and 40px. it goes in the grefine/main/webapp/modules/core/images folder.
grefine/main/webapp/modules/core/index.vt is edited to change Google Refine to BIORefine.
There are a handful of other files to be changed (mostly the .html files in grefine/main/webapp/modules/core/scripts/index/ such as create-project-ui-source-selection.html).
Edit: I'll probably update this guide as I find more that I've missed (like about.html - although I did leave all the text in, just added that BIORefine is a modified version of Google Refine).
Friday, December 2, 2011
Update
So, it looks like my proposal for my project will finally be approved soon. :-) Also the project is pretty much done. I have it good to go as far as Richard XXXX is concerned about generating sensible matches. Part of me would like to play and implement some sort of Needleman&Wunsch algorithm instead of edit distance, but... it's not a huge deal, and I don't think I'd get that much more out of it.
I look forward to seeing how it performs on Wednesday, when I'm on campus, and don't have to do the vpn-thing. It should perform much better - the cell names took the longest, but that's because there are thousands of them, and they're much more likely to have weird punctuation issues. Two hours to check a couple thousand, but then again, it beats checking them all by hand by a huge amount.
I took out the suggestion request code, it's just not something I see as valuable for the service to do. There's going to be enough overlap between tissue type and disease that it may be confusing and the user should know what they're looking for.
I look forward to seeing how it performs on Wednesday, when I'm on campus, and don't have to do the vpn-thing. It should perform much better - the cell names took the longest, but that's because there are thousands of them, and they're much more likely to have weird punctuation issues. Two hours to check a couple thousand, but then again, it beats checking them all by hand by a huge amount.
I took out the suggestion request code, it's just not something I see as valuable for the service to do. There's going to be enough overlap between tissue type and disease that it may be confusing and the user should know what they're looking for.
Tuesday, November 29, 2011
Updates and more thoughts
I added a nearest neighbor algorithm to find misspellings. I think I'm going to take out some of the pattern matchings and just use the nearest neighbor scoring. It seems to give better results, and when terms are very short, it'll work better. (A misgiven id name of "C2" getting matched to a hunderd different things that are very long isn't that useful).
I want to find out if I can have the reconciliation happen in an autogenerated new column, for the purpose of building a synonym table.
And I really hate mold, I feel so sick.
I want to find out if I can have the reconciliation happen in an autogenerated new column, for the purpose of building a synonym table.
And I really hate mold, I feel so sick.
Monday, November 21, 2011
Farewell to Freebase in Google Refine
To delete Freebase I:
- deleted extensions/freebase folder.
- edited extensions/build.xml to remove <ant dir="freebase/" target="build" />
- edited main/webapp/module/core/scripts/reconiliation/recon-manager.js and removed
ReconciliationManager.customServices.push({
"name" : "Freebase Query-based Reconciliation",
"ui" : { "handler" : "ReconFreebaseQueryPanel" }
});
and changed:
ReconciliationManager.registerStandardService(
"http://4.standard-reconcile.dfhuynh.user.dev.freebaseapps.com/reconcile");
}
to:
ReconciliationManager.registerStandardService(
"http://my service/reconcile");
}
- edited main/webapp/module/core/scripts/project/exporters.js to remove the MSQL and TripleWriter options.
- deleted extensions/freebase folder.
- edited extensions/build.xml to remove <ant dir="freebase/" target="build" />
- edited main/webapp/module/core/scripts/reconiliation/recon-manager.js and removed
ReconciliationManager.customServices.push({
"name" : "Freebase Query-based Reconciliation",
"ui" : { "handler" : "ReconFreebaseQueryPanel" }
});
and changed:
ReconciliationManager.registerStandardService(
"http://4.standard-reconcile.dfhuynh.user.dev.freebaseapps.com/reconcile");
}
to:
ReconciliationManager.registerStandardService(
"http://my service/reconcile");
}
- edited main/webapp/module/core/scripts/project/exporters.js to remove the MSQL and TripleWriter options.
Friday, November 18, 2011
Project is progressing well!
I've gotten Freebase disintegrated from Google Refine as much as possible for now. I was to explore a few other last questions I have about some of the freebase references that are coded into various classes.
I'm working on a reconciliation service configuration site that will allow users to add sources of data to the reconciliation service. It should be fun.
Wednesday, November 16, 2011
Milestone
I've reached a huge milestone. The service works with Google Refine now. This is excellent and fantastic and everything peachy because it enables me to run test cases so I can ensure it behaves as it should. Right now there's only one additional feature I want to add. It's done as far as I'm concerned.
Next: Take away Freebase extension, see if I can hard-code it to look for the company's service only.
Friday, November 11, 2011
Progress!
The single query works well. The multiple query is... just being difficult, but it should resolve itself soon. Next to do:
Add type-based help for matching.
Take out Freebase stuff. There's no reason why anyone needs to use the Freebase features at my company.
I should have this all done in a little less than a month's time. :-)
Wednesday, November 9, 2011
Reconciliation Service Tomcat Installation Notes
Note to self: class12.jar needs to be in the lib directory. After it's put in there tomcat needs to be restarted, so that it'll be added to the classpath.
Now to work on matching/scoring for the single item query.
Also - remember that multiple queries aren't working yet, so stop trying to run that!!!
I'm runninng on maybe 1 hr of sleep, so I'm a little loopy today. Yeay end of semester time combined with bad health that makes working on stuff harder (who doesn't love having a repetitive stress injury that is just getting worse by doing anything related to everything I need to do?).
Monday, November 7, 2011
Progress and To-Do
I have the GSON for the results working for the single query, kind of. Right now I just have it able to match a single result. I need to create a class that is an array of them.
I need to figure out how I want to handle multiple queries results (these are much more common in practical use).
Friday, November 4, 2011
JSON versus JSONP with a GSON example
JSON is a data format used for passing data between webapps.
An example of JSON
{
"name" : "Freebase Reconciliation Service",
"identifierSpace" : "http://rdf.freebase.com/ns/type.object.mid",
"schemaSpace" : "http://rdf.freebase.com/ns/type.object.id",
"view" : {
"url" : "http://www.freebase.com/view{{id}}"
},
"preview" : {
"url" : "http://www.freebase.com/widget/topic{{id}}?mode=content",
"width" : 430,
"height" : 300
},
"suggest" : {
"property" : "http://standard-reconcile.freebaseapps.com/suggest_property"
},
"defaultTypes" : [
{
"id" : "/people/person",
"name" : "Person"
},
{
"id" : "/location/location",
"name" : "Location"
}
]
}
JSON is meant to be used only within the same domain, a domain of origin thing to prevent its from being vulnerable to cross site scripting, like what was exploited like crazy when I was a young one just learning Java. But that's not always a very convenient solution for everyone, sometimes you just need to get JSON-based data from a 3rd party site (after all JSON has advantages over other methods), so what to do?
JSONP was developed to solve this. JSONP looks almost entirely just like JSON, even in name. An example of JSONP where "foo" is added to the call in form of a callback value
(i.e. http://www.mydomain.com/mywebapp?callback=foo):
foo({
"name" : "Freebase Reconciliation Service",
"identifierSpace" : "http://rdf.freebase.com/ns/type.object.mid",
"schemaSpace" : "http://rdf.freebase.com/ns/type.object.id",
"view" : {
"url" : "http://www.freebase.com/view{{id}}"
},
"preview" : {
"url" : "http://www.freebase.com/widget/topic{{id}}?mode=content",
"width" : 430,
"height" : 300
},
"suggest" : {
"property" : "http://standard-reconcile.freebaseapps.com/suggest_property"
},
"defaultTypes" : [
{
"id" : "/people/person",
"name" : "Person"
},
{
"id" : "/location/location",
"name" : "Location"
}
]
})
So if you're like me, and thinking GSON is neat and you want to use it, but need the data to be packaged in JSONP format, it's rather easy to fix this..
// Okay, it doesn't have to have the setPrettyPrinting,
// but it helps make it readable when troubleshooting!
Gson gson = new GsonBuilder().setPrettyPrinting().create();;
response.setCharacterEncoding("UTF-8");
response.setContentType("application/json");
response.setStatus(200);
PrintWriter out= response.getWriter();
String JSONstr = gson.toJson(object2beJSONized);
String callback = request.getParameter("callback");
JSONstr = callback+"("+JSONstr+");";
out.print(test);
Ta-da~!
I'm posting this because I really wish there had been this easy of an explanation for how to do this somewhere online.
An example of JSON
{
"name" : "Freebase Reconciliation Service",
"identifierSpace" : "http://rdf.freebase.com/ns/type.object.mid",
"schemaSpace" : "http://rdf.freebase.com/ns/type.object.id",
"view" : {
"url" : "http://www.freebase.com/view{{id}}"
},
"preview" : {
"url" : "http://www.freebase.com/widget/topic{{id}}?mode=content",
"width" : 430,
"height" : 300
},
"suggest" : {
"property" : "http://standard-reconcile.freebaseapps.com/suggest_property"
},
"defaultTypes" : [
{
"id" : "/people/person",
"name" : "Person"
},
{
"id" : "/location/location",
"name" : "Location"
}
]
}
JSON is meant to be used only within the same domain, a domain of origin thing to prevent its from being vulnerable to cross site scripting, like what was exploited like crazy when I was a young one just learning Java. But that's not always a very convenient solution for everyone, sometimes you just need to get JSON-based data from a 3rd party site (after all JSON has advantages over other methods), so what to do?
JSONP was developed to solve this. JSONP looks almost entirely just like JSON, even in name. An example of JSONP where "foo" is added to the call in form of a callback value
(i.e. http://www.mydomain.com/mywebapp?callback=foo):
foo({
"name" : "Freebase Reconciliation Service",
"identifierSpace" : "http://rdf.freebase.com/ns/type.object.mid",
"schemaSpace" : "http://rdf.freebase.com/ns/type.object.id",
"view" : {
"url" : "http://www.freebase.com/view{{id}}"
},
"preview" : {
"url" : "http://www.freebase.com/widget/topic{{id}}?mode=content",
"width" : 430,
"height" : 300
},
"suggest" : {
"property" : "http://standard-reconcile.freebaseapps.com/suggest_property"
},
"defaultTypes" : [
{
"id" : "/people/person",
"name" : "Person"
},
{
"id" : "/location/location",
"name" : "Location"
}
]
})
So if you're like me, and thinking GSON is neat and you want to use it, but need the data to be packaged in JSONP format, it's rather easy to fix this..
// Okay, it doesn't have to have the setPrettyPrinting,
// but it helps make it readable when troubleshooting!
Gson gson = new GsonBuilder().setPrettyPrinting().create();;
response.setCharacterEncoding("UTF-8");
response.setContentType("application/json");
response.setStatus(200);
PrintWriter out= response.getWriter();
String JSONstr = gson.toJson(object2beJSONized);
String callback = request.getParameter("callback");
JSONstr = callback+"("+JSONstr+");";
out.print(test);
Ta-da~!
I'm posting this because I really wish there had been this easy of an explanation for how to do this somewhere online.
To do for today
Get reconciliation service working - apparently Google Refine seems to make POST requests... I had it programmed focused on GET oops!
It looks like callbacks may be the issue, see how the vitro system handles them.
Google Refine 2.5 is nice, and I have the code checked out for it. It allows data from clipboard to be input, for some odd reason I think that it is a plus to have that.
Wednesday, November 2, 2011
More thoughts...
To do:
Get rid of namespace reconciliation feature - it's entirely Freebase-based.
Strip down features to what is useful per Rich (dif. between cell operations and column operations confusing).
Good news!
- Reconciliation service on tomcat is able to spit out metadata all happy-like. It shouldn't be too bad to extend, but I want to wait until everything else is done first.
Wednesday, October 26, 2011
New Ref
http://sourceforge.net/apps/mediawiki/vivo/index.php?title=Extending_Google_Refine_for_VIVO
--- this project is similar yet markedly different than mine.
To do lists
To do:
Highest priority: implement other vocab rec. services.
2nd: take all the freebase options out of the GUI.
3rd: start migration over to Tomcat server.
Wish List:
Link matched vocab terms up to pages with information about the term (our definition)
Create some way for a request to be sent to the appropriate person when person adds new term?
VERY HIGH PRIORITY: Meet w/ end-users to develop not-self generated use-case.
Personal To do:
1.) Get Systems Core requirement figured out.
2.) Work on form.
3.) Meet with appropriate prof to finish paperwork.
4.) Print out papers regarding term projects.
Friday, October 21, 2011
Refining away
So, yeah, now my project is going to focus on Google Refine. I'm honestly kind of exhausted, already in burn out mode. I have too much to do for next week. Right now I just need to figure out why the flask library isn't able to be resolved by eclipse. Ugh!
Friday, October 14, 2011
That took long enough
GWTUpload is a go! The problem was the libraries weren't being included in WEB-INF/lib. I guess I need to figure out what eclipse is putting there and why. Or just double-check to make sure it's the right stuff. At any rate, off to create a custom upload servlet to call the parsing functions I wrote a couple weeks ago.
Wednesday, October 12, 2011
Weird Errors
So, I posted to the GoogleGroup requesting help for the GWTUpload issues I've been having. Now I suspect that my code is alright, and that it's all Eclipse's fault. I'm updating my blog now because Eclipse decided that the library for GWT has gone "missing" even thought it was right where it has been for as long as I've been working on this project.
I'm most of the way through doing the File Upload the less-fun way, but if I can get GWTUpload to work in the future, it's preferable. I like the little upload-tracker bar, but I can't really justify spending time writing that code myself.
Anyway, that would totally make sense of why the servlet functions weren't working.
To do:
- Meet with liaison for project
- Meet up with test subjects, make sure that the project will actually help them with their day-to-day lives
- finish this stupid backbone....
Here's the grand plan, at least as far as I see it:
Basically the front end is where users add data (right now xls, xlsx, csv- formatted data). That data goes through a parser (which is mostly done), and it is then loaded into the back end mini-database. For the user to see the data, there will be a JSON generator that takes the db-data and sends it to the front end in the format of JSON. The TBA is for the user-directed operations on the data.
Sunday, October 9, 2011
Errors
I have them. On the plus side, I now know about a language called "Go", which is only of passing interest. I don't quite see its place unless Google starts pushing it a lot more, and even then...
Saturday, October 8, 2011
Tomcat
So this weekend, I'm working on setting up tomcat so I can test the upload/parsing functionality. I finally got it to run. :-) Now for the fun part, configuration. I'm glad I have some experience with Apache webserver from my college days.
Friday, October 7, 2011
GWTUpload FTW!
Good news, I have something that resembles an upload doing something that should theoretically be working (right now it's spitting out a 404, but the smart people on the Internet say that's because I'm using Jetty, not Tomcat, so this weekend, I aim to fix that, so that I can do proper testing). Now I need to integrate my uploading of a file with something to check that it's the right type (which is easy enough) and then to go send it into a SQLite file. The proper one. I'm not sure how I want to manage it. For now I think it should be handled by session/user. But thinking realistically, by project makes more sense.
Wednesday, October 5, 2011
Parsers
I've finished with all my file parsers. I have code to parse xls, xlsx, and csv files. Now to work on the database part of the project. Things are going along swimmingly. :-) I'm probably going to look into posting my code online for the parsers, after I make it pretty. It's not the best-est super genius thing ever (I used POI and Open-CSV), but it is unique (as far as I can tell) in that it handles multiple sheets in a coherent fashion.
SQLite is next. I'm not sure if I want to have it as a separate jar, like the parsers, I probably should. keep the server code to just interface stuff.
Actually, first is trying to fix the GWT code, it got jacked up somehow and isn't opening an internal jetty server anymore. I had tried to change it so I could just host it locally and set up a php script to do the file upload (it just works better, but I'm guessing I'm going to end up doing it in the hard way with Java because I like testing my code often).
TODO:
Long term:
-Wait for meeting with end users to be set up before I do too much UI work, find out if I can just pass that off to someone else.
-Look into the API I want to interface with.
Sunday, October 2, 2011
Friday, September 30, 2011
Update on project
After a lovely ~8 hrs of work, I'm thinking of working more at home tonight (but putting it as work I did tomorrow on my time-sheet, overtime pay is nice, but, I'd rather not annoy people).
Right now I found a work around for the file upload problem. Basically, while gwt has a super cute widget that's a "file uploader" it doesn't seem to be programmable to do much beyond look pretty and pull up an explorer style window. After searching online I found this guide. It's useful, but for the life of me, I can't get this guide to self-hosting the site (not through eclipse's default option) to work. From what I've seen of Java, it looks much more complicated and requires use of a servlet to upload a file to local directory.
I've got the PHP method theoretically working, well when I click submit, it looks for the script, and promptly dies as it "can't find it." This is something that won't be an issue when the program is done and I can throw it on a server. At least I think it won't be.
In other news, I've been exploring Apache POI. It looks pretty much ideal! It has functions for dealing with both .xls and .xlsx files! I'm not worried about parsing .csv files, as they are text files that, well, just aren't scary at all. I found a couple of good references.
I need to get the parsing done. I think an ArrayList matrix is the appropriate output, it can then be fed into a sqlite database without much pain. I learned my lesson from my perl pseudo-prototype, having nice looking code is great, but if you make it look too nice, you'll just end up having a slower program (AKA, I was sending the data line by line to the db, when I should've just had it all as one transaction), an important thing to remember with databases, is that your ability to write to the hd (write on, turning write off) is the bottle-neck, not the amount you're writing.
I'm not teribly worried about the sqlite portion of the program, I'm becoming a SQL guru thanks to one of my classes. I've even programmed C# UDTs for fun (well for extra credit, but still I did it when I didn't have to).
Anyway, I have a lot to do, and probably have my hands in too many pots at once at this point. I just need to have something completely done!
Wednesday, September 28, 2011
Thesis Project
Done so far:
D/led code for google refine
have eclipse set up
Prototype UI
Found references for part of what I want to do
(dynamic tables, SQLite3, RPC)
To do:
Get Apache POI working
Get it to do the basics
Contact XXXX regarding Google Refine and Oracle DataSpaces info
Meet with XXXXX regarding what he thinks would be most useful for him
Contact XXXXXXX to find out who else does biocuration and get their input
Hack the heck out of Google Refine
Get info on Oracle Dataspaces
Wednesday, January 19, 2011
Registered!
I'm registered for classes (and have been for a while I'm a lazy blogger).
I'm going to be taking:
Advanced Objected Oriented Design (mostly for as a refresher)
Analysis of Algorithms II
Advanced Operating Systems (which I may drop for Data Mining, as I'm not confident enough with my C pointers and assembly).
All and all, I feel a mixture of elation and fear. Fear that I won't do well enough, happy that I got accepted into the program.
I began my under graduate career as a Computer Science major, I did that for three years before I switched to biology. Biology ended up being fun, but not what I want to do career wise. So, now I'm in graduate school studying Computer Science again.
I'm going to be taking:
Advanced Objected Oriented Design (mostly for as a refresher)
Analysis of Algorithms II
Advanced Operating Systems (which I may drop for Data Mining, as I'm not confident enough with my C pointers and assembly).
All and all, I feel a mixture of elation and fear. Fear that I won't do well enough, happy that I got accepted into the program.
I began my under graduate career as a Computer Science major, I did that for three years before I switched to biology. Biology ended up being fun, but not what I want to do career wise. So, now I'm in graduate school studying Computer Science again.
Saturday, January 8, 2011
Thursday, January 6, 2011
Subscribe to:
Posts (Atom)