Importieren einer Excel-Datei mit griechischen Zeichen in R in der richtigen Codierung
Ich habe Probleme beim Importieren der folgenden Datei:http://www.kuleuven.be/bio/ento/temp/test.xlsx in R in der richtigen Codierung. Speziell,
library("xlsx")
read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")
gibt mir
X1
1 a-cadinol
2 a-calacorene
3 a-caryophyllene alcohol
4 a-curcumene
5 a-elemol
6 a-muurolene
7 a-terpineol acetate
8 ß-4-dimethyl-3-cyclohexane-1-ethanol acetate
9 ß-bisabolene
10 ß-bisabolol
11 ß-bourbonene
12 ß-caryophyllene alcohol
13 ß-cyclocitral
14 ß-farnesol
15 ß-selinene
16 ß-sesquiphellandrene
17 <U+03B3>-cadinene
18 <U+03B3>-Carboethoxy-<U+03B3>-butyrolactone
19 <U+03B3>-ethyl-<U+03B3>-butyrolactone
20 <U+03B3>-eudesmol
21 <U+03B3>-muurolene
22 <U+03B3>-nonalactone
23 <U+03B3>-octalactone
24 <U+03B3>-selinene
25 <U+03B3>-undecalactone
26 d-cadinene
27 d-cadinol
28 d-muurolene
29 d-undecalactone
aber diea-
, <U+03B3>-
undd-
sollte seinalpha-
, gamma-
unddelta-
Überlegen Sie, wie ich meine Datei in der richtigen Codierung importieren kann?
Ich arbeite an Windows undiconvlist()
gibt mir
[1] "437" "850" "852" "855" "857"
[6] "860" "861" "862" "863" "865"
[11] "866" "869" "ANSI_X3.4-1968" "ANSI_X3.4-1986" "ASCII"
[16] "ASMO-708" "BIG-5" "BIG-FIVE" "big5" "BIG5"
[21] "big5-hkscs" "BIG5-HKSCS" "big5hkscs" "BIG5HKSCS" "CP-GR"
[26] "CP-IS" "cp1025" "CP1125" "CP1133" "CP1200"
[31] "CP12000" "CP12001" "CP1201" "CP1250" "CP1251"
[36] "CP1252" "CP1253" "CP1254" "CP1255" "CP1256"
[41] "CP1257" "CP1258" "CP1361" "CP154" "CP367"
[46] "CP437" "CP50221" "CP51932" "CP65001" "CP737"
[51] "CP775" "CP819" "CP850" "CP852" "CP853"
[56] "CP855" "CP857" "CP858" "CP860" "CP861"
[61] "CP862" "CP863" "CP864" "CP865" "cp866"
[66] "CP866" "CP869" "CP874" "cp875" "CP932"
[71] "CP936" "CP949" "CP950" "CSASCII" "CSIBM855"
[76] "CSIBM857" "CSIBM860" "CSIBM861" , "CSIBM863" "CSIBM864"
[81] "CSIBM865" "CSIBM866" "CSIBM869" "csISO2022JP" "CSISOLATIN1"
[86] "CSPC775BALTIC" "CSPC850MULTILINGUAL" "CSPC862LATINHEBREW" "CSPC8CODEPAGE437" "CSPCP852"
[91] "CSPTCP154" "CSWINDOWS31J" "CYRILLIC-ASIAN" "DOS-720" "DOS-862"
[96] "EUC-CN" "euc-jp" "euc-kr" "EUC-KR" "EUCCN"
[101] "eucjp" "euckr" "GB18030" "gb2312" "GBK"
[106] "hz-gb-2312" "IBM-CP1133" "IBM-Thai" "IBM00858" "IBM00924"
[111] "IBM01047" "IBM01140" "IBM01141" "IBM01142" "IBM01143"
[116] "IBM01144" "IBM01145" "IBM01146" "IBM01147" "IBM01148"
[121] "IBM01149" "IBM037" "IBM1026" "IBM273" "IBM277"
[126] "IBM278" "IBM280" "IBM284" "IBM285" "IBM290"
[131] "IBM297" "IBM367" "IBM420" "IBM423" "IBM424"
[136] "IBM437" "IBM437" "IBM500" "ibm737" "ibm775"
[141] "IBM775" "IBM819" "ibm850" "IBM850" "ibm852"
[146] "IBM852" "IBM855" "IBM855" "ibm857" "IBM857"
[151] "IBM860" "IBM860" "ibm861" "IBM861" "IBM862"
[156] "IBM863" "IBM863" "IBM864" "IBM864" "IBM865"
[161] "IBM865" "IBM866" "ibm869" "IBM869" "IBM870"
[166] "IBM871" "IBM880" "IBM905" "iso-2022-jp" "iso-2022-jp"
[171] "ISO-2022-JP" "ISO-2022-JP-MS" "iso-2022-kr" "ISO-8859-1" "iso-8859-13"
[176] "iso-8859-15" "iso-8859-2" "iso-8859-3" "iso-8859-4" "iso-8859-5"
[181] "iso-8859-6" "iso-8859-7" "iso-8859-8" "iso-8859-8-i" "iso-8859-9"
[186] "ISO-IR-100" "ISO-IR-6" "ISO_646.IRV:1991" "ISO_8859-1" "ISO_8859-1:1987"
[191] "ISO2022-JP" "ISO2022-JP-MS" "iso2022-kr" "ISO646-US" "iso8859-1"
[196] "ISO8859-1" "iso8859-13" "iso8859-15" "iso8859-2" "iso8859-3"
[201] "iso8859-4" "iso8859-5" "iso8859-6" "iso8859-7" "iso8859-8"
[206] "iso8859-8-i" "iso8859-9" "Johab" "JOHAB" "koi8-r"
[211] "koi8-u" "ks_c_5601-1987" "L1" "latin-9" "LATIN1"
[216] "latin2" "latin3" "latin4" "latin5" "latin7"
[221] "latin9" "mac" "mac-centraleurope" "mac-is" "macarabic"
[226] "maccentraleurope" "maccroatian" "maccyrillic" "macgreek" "machebrew"
[231] "maciceland" "macintosh" "macis" "macroman" "macromania"
[236] "macthai" "macturkish" "macukraine" "macukrainian" "MS-ANSI"
[241] "MS-ARAB" "MS-CYRL" "MS-EE" "MS-GREEK" "MS-HEBR"
[246] "MS-TURK" "MS50221" "MS51932" "MS932" "MS936"
[251] "PT154" "PTCP154" "SHIFFT_JIS" "SHIFFT_JIS-MS" "shift-jis"
[256] "shift_jis" "SJIS" "SJIS-MS" "SJIS-OPEN" "SJIS-WIN"
[261] "UCS-2" "UCS-2BE" "UCS-2LE" "UCS-4" "UCS-4BE"
[266] "UCS-4BE" "UCS-4LE" "UCS-4LE" "UCS2" "UCS2BE"
[271] "UCS2LE" "UCS4" "UCS4BE" "UCS4LE" "UHC"
[276] "unicodeFFFE" "US" "US-ASCII" "UTF-16" "UTF-16BE"
[281] "UTF-16LE" "UTF-32" "UTF-32BE" "UTF-32LE" "UTF-8"
[286] "UTF16" "UTF16BE" "UTF16LE" "UTF32" "UTF32BE"
[291] "UTF32LE" "UTF8" "WINBALTRIM" "windows-1250" "windows-1251"
[296] "windows-1252" "windows-1253" "windows-1254" "windows-1255" "windows-1256"
[301] "windows-1257" "windows-1258" "WINDOWS-31J" "WINDOWS-50221" "WINDOWS-51932"
[306] "windows-874" "WINDOWS-932" "WINDOWS-936" "x-Chinese_CNS" "x-cp20001"
[311] "x-cp20003" "x-cp20004" "x-cp20005" "x-cp20261" "x-cp20269"
[316] "x-cp20936" "x-cp20949" "x-cp50227" "x-EBCDIC-KoreanExtended" "x-Europa"
[321] "x-IA5" "x-IA5-German" "x-IA5-Norwegian" "x-IA5-Swedish" "x-iscii-as"
[326] "x-iscii-be" "x-iscii-de" "x-iscii-gu" "x-iscii-ka" "x-iscii-ma"
[331] "x-iscii-or" "x-iscii-pa" "x-iscii-ta" "x-iscii-te" "x-mac-arabic"
[336] "x-mac-ce" "x-mac-chinesesimp" "x-mac-chinesetrad" "x-mac-croatian" "x-mac-cyrillic"
[341] "x-mac-greek" "x-mac-hebrew" "x-mac-icelandic" "x-mac-japanese" "x-mac-korean"
[346] "x-mac-romanian" "x-mac-thai" "x-mac-turkish" "x-mac-ukrainian" "x_Chinese-Eten"
Ich habe es mit vielen versucht, ohne Erfolg ... Leider weiß ich auch nicht, in welcher Codierung Excel meine Datei gespeichert hat ...
Gibt es eine einfache Funktion in R, mit der ich alle griechischen Alpha-, Beta-, Gamma- und Delta-Werte (in der ursprünglichen Codierung) in "Alpha", "Beta", "Gamma" und "Delta" konvertieren kann (dh in geschrieben)? voll)? Oder umgekehrt, dh vollständig ausgeschriebene "alpha", "beta", "gamma" usw. in einzelne griechische Zeichen umwandeln?
EDIT: bezüglich meiner letzten frage habe ich es versucht
togreek=function(compname) {
n=as.character(compname,encoding="UTF-8")
n=gsub("alpha","\u03B1",n)
n=gsub("beta","\u03B2",n)
n=gsub("gamma","\u03B3",n)
n=gsub("delta","\u03B4",n)
n=gsub("epsilon","\u03B5",n)
n
}
tolatin=function(compname) {
n=as.character(compname,encoding="UTF-8")
n=gsub("\u03B1","alpha",n)
n=gsub("\u03B2","beta",n)
n=gsub("\u03B3","gamma",n)
n=gsub("\u03B4","delta",n)
n=gsub("\u03B5","epsilon",n)
n
}
Die Funktion Tolatin scheint zu funktionieren:
library("xlsx")
test=read.xlsx("test.xlsx",1,header=F,colClasses=c("character"),encoding="UTF-8")
tolatin(test$X1)
[1] "alpha-cadinol" "alpha-calacorene" "alpha-caryophyllene alcohol"
[4] "alpha-curcumene" "alpha-elemol" "alpha-muurolene"
[7] "alpha-terpineol acetate" "beta-4-dimethyl-3-cyclohexane-1-ethanol acetate" "beta-bisabolene"
[10] "beta-bisabolol" "beta-bourbonene" "beta-caryophyllene alcohol"
[13] "beta-cyclocitral" "beta-farnesol" "beta-selinene"
[16] "beta-sesquiphellandrene" "gamma-cadinene" "gamma-Carboethoxy-gamma-butyrolactone"
[19] "gamma-ethyl-gamma-butyrolactone" "gamma-eudesmol" "gamma-muurolene"
[22] "gamma-nonalactone" "gamma-octalactone" "gamma-selinene"
[25] "gamma-undecalactone" "delta-cadinene" "delta-cadinol"
[28] "delta-muurolene" "delta-undecalactone"
Aber wenn ich dann wieder in griechische Zeichen konvertiere, stoße ich wieder auf Probleme:
togreek(tolatin(test$X1))
[1] "α-cadinol" "α-calacorene" "α-caryophyllene alcohol"
[4] "α-curcumene" "α-elemol" "α-muurolene"
[7] "α-terpineol acetate" "ß-4-dimethyl-3-cyclohexane-1-ethanol acetate" "ß-bisabolene"
[10] "ß-bisabolol" "ß-bourbonene" "ß-caryophyllene alcohol"
[13] "ß-cyclocitral" "ß-farnesol" "ß-selinene"
[16] "ß-sesquiphellandrene" "<U+03B3>-cadinene" "<U+03B3>-Carboethoxy-<U+03B3>-butyrolactone"
[19] "<U+03B3>-ethyl-<U+03B3>-butyrolactone" "<U+03B3>-eudesmol" "<U+03B3>-muurolene"
[22] "<U+03B3>-nonalactone" "<U+03B3>-octalactone" "<U+03B3>-selinene"
[25] "<U+03B3>-undecalactone" "d-cadinene" "d-cadinol"
[28] "d-muurolene" "d-undecalactone"
Irgendwelche Gedanken, was ich falsch mache?